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About the book .. . 


Like its predecessor, the second edi- 
tion of An Introduction to Probability 
Theory and Its Applications serves a 
dual purpose. It develops probability 
theory rigorously as a mathematical 
discipline, and, at the same time, illus- 
trates the broad variety of practical 
problems with modern techniques used 
in their solution. For his itlustrative 
material and examples, the author 
draws from a great many fields, in- 
cluding engineering, genetics, physics, 
and statistics. He includes new results 
concerning fluctuation theory devel- 
oped by elementary methods. 


In addition to new material in many 
of the chapters, the text includes two 
new chapters. One of these covers by 
elementary methods surprising phe- 
nomena of random walks and general 
fluctuation theory. An extended treat- 
ment of compound distributions and 
branching processes is offered in the 
other new chapter. 


In order to present the intuitive 
background, and the basic concepts of 
probability theory unhampered by 
analytical formalism, the first volume 
is restricted to discrete sample spaces. 
The restricted coverage allows the 
author to treat many typical problems 
in great detail and to explain the 
probabilistic approach to them. 
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Preface to the Second Fdition 


THE FAVORABLE RECEPTION OF THE FIRST EDITION SURPASSED 
the most daring anticipation and, in addition to an unexpected num- 
ber of users, the book seems to have found friends who read it merely 
for fun; it is most heartening that they range from pure mathematicians 
to pure amateurs. Although I cannot here express individual thanks to 
all readers to whom I am indebted for useful critical comments, their 
communications stimulated me during six years to think of improve- 
ments and to collect better examples and exercises, I hope that these 
will make for easier reading and teaching from the book. 

The general plan, as described in the preface to the first edition, re- 
mains unchanged. To accommodate the manifold needs of readers 
with divergent backgrounds, interests, and degrees of mathematical 
sophistication, it was necessary frequently to deviate from the main 
path. The exposition therefore does not always progress from the easy 
to the difficult; comparatively technical sections appear at the begin- 
ning and easy sections in chapters XV and XVII. Inexperienced 
readers should not attempt to follow many side lines lest they lose 
sight of the forest for too many trees. To facilitate orientation and 
the choice of desirable omissions, stars are used more systematically 
than in the first edition. The unstarred sections form a self-contained 
whole in which the starred sections are not used. 

A first introduction to the basic notions of probability ts contained in 
chapters I, V, VI, IX; beginners should cover these with as Jew digressions 
as possible. Chapter II is designed to develop the student’s technique 
and probabilistic intuition; some experience in its contents is desirable, 
but it is not necessary to cover the chapter systematically: it may 
prove more profitable to return to the elementary illustrations as occa- 
sion arises at later stages. For the purposes of a first introduction, the 
restriction to discrete distributions should not be a serious handicap 
since the elementary theory of continuous distributions requires only 
a few words of supplementary explanation. 

From chapter IX an introductory course may proceed directly to 
chapter XI, considering generating functions as an example of more 
general transforms. Chapter XI should be followed by some applica- 
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tions in chapters XIII (recurrent events) or XIT (chain reactions, in- 
finitely divisible distributions). Without generating functions it is 
possible to turn in one of the following directions: limit theorems and 
fluctuation theory (chapters VITI, X, III); stochastic processes (chap- 
ter XVII); random walks (chapter ITI and the main part of XIV). 
These chapters are almost independent of each other. The Markov 
chains of chapter XV depend conceptually on recurrent events, but 
they may be studied independently if the reader is willing to accept 
without proof the basic ergodic theorem. 


Space saved by streamlining made it possible to add new material 
and to integrate the old third chapter with chapter IT. New emphasis 
is laid on waiting times, a topic now serving as a unifying thread 
throughout the book. This emphasis is reflected in the early intro- 
duction of waiting times in chapter II and in the several independent 
treatments of the first-passage times in random walks. 

Chapter III is entirely new. It illustrates the power of combinatorial 
methods by deriving in an elementary way important results previously 
obtained by advanced analytical tools. The results concerning fluctua- 
tions in coin tossing show that widely held beliefs about the law of 
large numbers are fallacious. These results are so amazing and so at 
variance with common intuition that even sophisticated colleagues 
doubted that coins actually misbehave as theory predicts. ‘The record 
of a simulated experiment is therefore included in section 7. 

A new stress on the essential unity of recurrent events and Markov 
chains permitted improvements and simplifications, but at the cost of 
a change from the terminology of the first edition. I am deeply apolo- 
getic for the confusion which is bound to ensue. 

Great care has been taken to render the index usable, but it cannot 
serve as a Who’s Who in probability: the proper balance is destroyed 
by references to all papers that chanced to lead, often indirectly, to 
the construction of an example or exercise. I regret that sometimes 
important contributions are quoted in an irrelevant context not indica- 
tive of their value. 

This edition was prepared under ideal working conditions without 
interruptions by routine duties. For this ease I must thank the Air 
Force Office of Scientific Research, Princeton University, and the 
stimulating hospitality of J. Wolfowitz. I have continued to benefit 
from the helpful criticism of J. L. Doob. The careful checking of 
manuscript and proofs by my wife has removed many errors and effects 
of chance. 

WILLIAM FELLER 

August 1957 


Pretace to the First Edition 


IT WAS THE AUTHOR’S ORIGINAL INTENTION TO WRITE A BOOK ON 
analytical methods in probability theory in which the latter was to be 
treated as a topic in pure mathematics. Such a treatment would have 
been more uniform and hence more satisfactory from an aesthetic 
point of view; it would also have been more appealing to pure mathe- 
maticians. However, the generous support by the Office of Naval Re- 
search of work in probability theory at Cornell University led the 
author to a more ambitious and less thankful undertaking of satisfying 
heterogeneous needs. 

It is the purpose of this book to treat probability theory as a self- 
contained mathematical subject rigorously, avoiding non-mathematical 
concepts. At the same time, the book tries to describe the empirical 
background and to develop a feeling for the great variety of practical 
applications. This purpose is served by many special problems, nu- 
merical estimates, and examples which interrupt the main flow of the 
text. They are clearly set apart in print and are treated in a more 
picturesque language and with less formality. A number of special 
topics have been included in order to exhibit the power of general 
methods and to increase the usefulness of the book to specialists in 
various fields. To facilitate reading, detours from the main path are 
indicated by stars. The knowledge of starred sections is not assumed 
in the remainder. 

A serious attempt has been made to unify methods. The specialist 
will find many simplifications of existing proofs and also new results. 
In particular, the theory of recurrent events has been developed for the 
purpose of this book. It leads to a new treatment of Markov chains 
which permits simplification even in the finite case. 

The examples are accompanied by about 340 problems mostly with 
complete solutions. Some of them are simple exercises, but most of 
them serve as additional illustrative material to the text or contain 
various complements. One purpose of the examples and problems is 
to develop the reader’s intuition and art of probabilistic formulation. 
Several previously treated examples show that apparently difficult 
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problems may become almost trite once they are formulated in a 
natural way and put into the proper context. 

There is a tendency in teaching to reduce probability problems to 
pure analysis as soon as possible and to forget the specific character- 
istics of probability theory itself. Such treatments are based on a 
poorly defined notion of random variables usually introduced at the 
outset. This book goes to the other extreme and dwells on the notion 
of sample space, without which random variables remain an artifice. 

In order to present the true background unhampered by measura- 
bility questions and other purely analytic difficulties this volume is 
restricted to discrete sample spaces. This restriction is severe, but 
should be welcome to non-mathematical users. It permits the inclusion 
of special topics which are not easily accessible in the literature. At 
the same time, this arrangement makes it possible to begin in an ele- 
mentary way and yet to include a fairly exhaustive treatment of such 
advanced topics as random walks and Markov chains. The general 
theory of random variables and their distributions, limit theorems, 
diffusion theory, etc., is deferred to a succeeding volume. 

This book would not have been written without the support of the 
Office of Naval Research. One consequence of this support was a 
fairly regular personal contact with J. L. Doob, whose constant criti- 
cism and encouragement were invaluable. To him go my foremost 
thanks. The next thanks for help are due to John Riordan, who 
followed the manuscript through two versions. Numerous corrections 
and improvements were suggested by my wife who read both the manu- 
script and proof. 

The author is also indebted to K. L. Chung, M. Donsker, and 
S. Goldberg, who read the manuscript and corrected various mistakes; 
the solutions to the majority of the problems were prepared by S. 
Goldberg. Finally, thanks are due to Kathryn Hollenbach for patient 
and expert typing help; to E. Elyash, W. Hoffman, and J. R. Kinney 
for help in proofreading. 

WILLIAM FELLER 

Cornell University 

January 1950 
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INTRODUCTION 


The Nature 
of Probability Theory 


1. THE BACKGROUND 


Probability is a mathematical discipline with aims akin to those, 
for example, of geometry or analytical mechanics. In each field we 
must carefully distinguish three aspects of the theory: (a) the formal 
logical content, (Ὁ) the intuitive background, (c) the applications. 
The character, and the charm, of the whole structure cannot be ap- 
preciated without considering all three aspects in their proper relation. 


(a) Formal Logical Content 


Axiomatically, mathematics is concerned solely with relations among 
undefined things. This property is well illustrated by the game of 
chess. It is impossible to “define” chess otherwise than by stating a 
set of rules. The conventional shape of the pieces may be described 
to some extent, but it will not always be obvious which piece is in- 
tended for “king.” The chessboard and the pieces are helpful, but 
they can be dispensed with. The essential thing is to know how the 
pieces move and act. It is meaningless to talk about the ‘‘definition”’ 
or the “true nature” of a pawn or a king. Similarly, geometry does 
not care what a point and a straight line “really are.” They remain 
undefined notions, and the axioms of geometry specify the relations 
among them: two points determine a line, etc. These are the rules, 
and there is nothing sacred about them. We change the axioms to 
study different forms of geometry, and the logical structure of the 
several non-Euclidean geometries is independent of their relation to 
reality. Physicists have studied the motion of bodies under laws of 
attraction different from Newton’s, and such studies are meaningful 
even if Newton’s law of attraction is accepted as true in nature. 
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2 INTRODUCTION 


(b) Intuitive Background 


In contrast to chess, the axioms of geometry and of mechanics refer 
to an existing intuitive background. In fact, geometrical intuition is 
so strong that it is prone to run ahead of logical reasoning. The extent 
to which logic, intuition, and physical experience are interdependent is 
a problem into which we need not enter. Certainly intuition can be 
trained and developed. The bewildered novice in chess moves cau- 
tiously, recalling individual rules, whereas the experienced player ab- 
sorbs a complicated situation at a glance and is unable to account ra- 
tionally for his intuition. In like manner mathematical intuition grows 
with experience, and it is possible to develop a natural feeling for con- 
cepts such as a four-dimensional space. | 

Even the collective intuition of mankind appears to progress. New- 
ton’s notions of a field of force and of action at a distance and Max- 
well’s concept of electromagnetic ‘‘waves’’ were at first decried as ‘‘un- 
thinkable”’ and “contrary to intuition.” Modern technology and radio 
in the homes have popularized these notions to such an extent that 
they form part of the ordinary vocabulary. Similarly, the modern 
student has no appreciation of the modes of thinking, the prejudices, 
and other difficulties against which the theory of probability had to 
struggle when it was new. Nowadays newspapers report on samples 
of public opinion, and the magic of statistics embraces all phases of life 
to the extent that young girls watch the statistics of their chances to 
get married. Thus everyone has acquired a feeling for the meaning of 
statements such as ‘‘the chances are three in five.” Vague as it is, 
this intuition serves as background and guide for the first step. It will 
be developed as the theory progresses and acquaintance is made with 
more sophisticated applications. 


(c) Applications 

The concepts of geometry and mechanics are in practice identified 
with certain physical objects, but the process is so flexible and variable 
that no general rules can be given. The notion of a rigid body is 
fundamental and useful, even though no physical object is rigid. 
Whether a given body can be treated as if it were rigid depends on 
the circumstances and the desired degree of approximation. Rubber 
is certainly not rigid, but in discussing the motion of automobiles text- 
books treat the rubber tires as rigid bodies. Depending on the purpose 
of the theory, we disregard the atomic structure of matter and treat 
the sun now as a ball of continuous matter, now as a single mass point. 

In applications, the abstract mathematical models serve as tools, 
and different models can describe the same empirical situation. The 
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manner in which mathematical theories are applied does not depend on 
preconceived ideas; i118 a purposeful technique depending on, and changing 
with, experience. A philosophical analysis of such techniques is a legiti- 
mate study, but it is not within the realm of mathematics, physics, or 
statistics. The philosophy of the foundations of probability must be 
divorced from mathematics and statistics, exactly as the discussion of 
our intuitive space concept is now divorced from geometry. 


2. PROCEDURE 


The history of probability (and of mathematics in general) shows a 
stimulating interplay of theory and applications; theoretical progress 
opens new fields of applications, and in turn applications lead to new 
problems and fruitful research. The theory of probability is now ap- 
plied in many diverse fields, and we require the flexibility of a general 
theory to provide appropriate tools for so great a variety of needs. 
We must therefore withstand the temptation (and the pressure) to 
build the theory, its terminology, and its arsenal too close to one par- 
ticular sphere of interest. We wish instead to develop a mathematical 
theory in the established way which has proved so successful in geom- 
etry and mechanics. 

We shall start from the simplest experiences such as tossing a coin 
or throwing dice, where all statements have an obvious intuitive mean- 
ing. This intuition will be translated into an abstract model to be 
generalized gradually and by degrees. [Illustrative examples will be 
provided to explain the empirical background of the several models 
and to develop the reader’s intuition, but the theory itself will be of a 
-mathematical character. We shall no more attempt to explain the 
“true meaning” of probability than the modern physicist dwells on the 
“real meaning” of mass and energy or the geometer discusses the nature 
of a point. Instead, we shall prove theorems and show how they are 
applied. 

At the outset the purpose of the theory of probability was to describe 
the exceedingly narrow domain of experience connected with games of 
chance, the main effort being directed to the calculation of certain 
probabilities. In the opening chapters we too shall calculate a few 
typical probabilities, but it should be borne in mind that numerical 
probabilities are not the principal object of the theory. Its aim is to 
discover general laws and to construct satisfactory theoretical models. 

Probabilities play for us the same role as masses in mechanics. The 
motion of the planetary system can be discussed without knowledge 
of the individual masses and without contemplating methods for their 
actual measurements. Even non-existent planetary systems may be 
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the object of a profitable and illuminating study. Similarly, practical 
and useful probability models may refer to non-observable worlds. For 
example, billions of dollars have been invested in automatic telephone 
exchanges. These are based on simple probability models in which 
various possible systems are compared. The theoretically best system 
is built and the others will never exist. In insurance, probability 
theory is used to calculate the probability of ruin; that is, the theory 
is used to avoid certain undesirable situations, and consequently it ap- 
plies to situations that are not actually observed. Probability theory 
would be effective and useful even if not a single numerical value were 


accessible. 
3. “STATISTICAL”? PROBABILITY 


The success of the modern mathematical theory of probability is 
bought at a price: the theory is limited to one particular aspect of 
“chance.” The intuitive notion of probability is connected with in- 
ductive reasoning and with judgments such as ‘Paul is probably a 
happy man,” ‘Probably this book will be a failure,” ‘“Fermat’s con- 
jecture is probably false.” Judgments of this sort are of interest to 
the philosopher and the logician, and they are a legitimate object of a 
mathematical theory.!. It must be understood, however, that we are 
concerned not with modes of inductive reasoning but with something 
that might be called physical or statistical probability. In a rough way 
we may characterize this concept by saying that our probabilities do 
not refer to judgments but to possible outcomes of a conceptual experi- 
ment. Before we speak of probabilities, we must agree on an idealized 
model of a particular conceptual experiment such as tossing a coin, 
sampling kangaroos on the moon, observing a particle under diffusion, 
counting the number of telephone calls. At the outset we must agree 
on the possible outcomes of this experiment (our sample space) and the 
probabilities associated with them. This is analogous to the procedure 
in mechanics where fictitious models involving two, three, or seventeen 
mass points are introduced, these points being devoid of individual 
properties. Similarly, in analyzing the coin tossing game we are not 
concerned with the accidental circumstances of an actual experiment: 
the object of our theory are sequences (or arrangements) of symbols 
such as “head, head, tail, head, ....’’ There is no place in our system 
for speculations concerning the probability that the sun will rise to- 
morrow. Before speaking of it we should have to agree on an (idealized) 


1B. O. Koopman, The axioms and algebra of intuitive probability, Annals of 
Mathematics (2), vol. 41 (1940), pp. 269-292, and The bases of probability, Bulletin 
of the American Mathematical Society, vol. 46 (1940), pp. 763-774. 
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model which would presumably run along the lines “out of infinitely 
many worlds one is selected at random... . Little imagination is 
required to construct such a model, but it appears both uninteresting 
and meaningless. 

The astronomer speaks of measuring the temperature at the center 
of the sun or of travel to Sirius. These operations seem impossible, 
and yet it is not senseless to contemplate them. By the same token, 
we shall not worry whether or not our conceptual experiments can be 
performed; we shall analyze abstract models. In the back of our minds 
we keep an intuitive interpretation of probability which gains opera- 
tional meaning in certain applications. We imagine the experiment 
performed a great many times. An event with probability 0.6 should 
be expected, in the long run, to occur sixty times out of a hundred. 
This description is deliberately vague but supplies a picturesque intui- 
tive background sufficient for the more elementary applications. As 
the theory proceeds and grows more elaborate, the operational meaning 
and the intuitive picture will become more concrete. 


4. SUMMARY 


We shall be concerned with theoretical models in which probabilities 
enter as free parameters in much the same way as masses in mechanics. 
They are applied in many and variable ways. The technique of appli- 
cations and the intuition develop with the theory. 

This is the standard procedure accepted and fruitful in other mathe- 
matical disciplines. No alternative has been devised which could con- 
ceivably fill the manifold needs and requirements of all branches of the 
growing entity called probability theory and its applications. 

We may fairly lament that intuitive probability is insufficient for 
scientific purposes, but it is a historical fact. In example I(6.b), we 
shall discuss random distributions of particles in compartments. The 
appropriate, or “natural,” probability distribution seemed perfectly 
clear to everyone and had been accepted without hesitation by physi- 
cists. It turned out, however, that physical particles are not trained 
in human common sense, and the “natural” (or Boltzmann) distribu- 
tion had to be given up for the Einstein-Bose distribution in some 
cases, for the Fermi-Dirac distribution in others. No intuitive argu- 
ment has been offered why photons should behave differently from 
protons and why they do not obey the “a priori” laws. If a justifica- 
tion could now be found, it would only show that intuition develops 
with theory. At any rate, even for applications freedom and flexibility 
are essential, and it would be pernicious to fetter the theory to fixed 
poles. 
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It has also been claimed that the modern theory of probability is too 
abstract and too general to be useful. This is the battle cry once 
raised by practical-minded people against Maxwell’s field theory. The 
argument could be countered by pointing to the unexpected new appli- 
cations opened by the abstract theory of stochastic processes or to the 
new insights offered by the modern fluctuation theory which once more 
belies intuition and is leading to a revision of practical attitudes. 
However, the discussion is useless; it is too easy to condemn. Only 
yesterday the practical things of today were decried as impractical, 
and the theories which will be practical tomorrow will always be 
branded as valueless games by the practical men of today. 


5. HISTORICAL NOTE 


The statistical, or empirical, attitude toward probability has been 
developed mainly by R. A. Fisher and R. von Mises. The notion of 
sample space ? comes from von Mises. This notion made it possible to 
build up a strictly mathematical theory of probability based on meas- 
ure theory. Such an approach has emerged gradually in the ’twenties 
under the influence of many authors. An axiomatic treatment repre- 
senting the modern development was given by A. Kolmogorov.? We 
shall follow this line, but the term axiom appears too solemn inasmuch 
as the present volume deals only with the simple case of discrete proba- 
bilities. 


2See his book, Wahrscheinlichkeitsrechnung, Leipzig and Wien, 1931, with refer- 
ences to his original papers dating back to about 1921. The German word is 
Merkmalraum (label space). 

$A. Kolmogoroff, Grundbegriffe der Wahrscheinlichkeitsrechnung, fase. 3 of vol. 
2 of Ergebnisse der Mathematik, Berlin, 1933. 


CHAPTER I 


The Sample Space 


1. THE EMPIRICAL BACKGROUND 


The mathematical theory of probability gains practical value and an 
intuitive meaning in connection with real or conceptual experiments 
such as tossing a coin once, tossing a coin 100 times, throwing three 
dice, arranging a deck of cards, matching two decks of cards, playing 
roulette, observing the life span of a radioactive atom or a person, 
selecting a random sample of people and observing the number of left- 
handers in it, crossing two species of plants and observing the pheno- 
types of the offspring; or with phenomena such as the sex of a newborn 
baby, the number of busy trunklines in a telephone exchange, the 
number of calls on a telephone, random noise in an electrical com- 
munication system, routine quality control of a production process, 
frequency of accidents, the number of double stars in a region of the 
skies, the position of a particle under diffusion. All these descriptions 
are rather vague, and, in order to render the theory meaningful, we 
have to agree on what we mean by possible results of the experiment or 
observation in question. 

When a coin is tossed, it does not necessarily fall heads or tails; it 
can roll away or stand on its edge. Nevertheless, we shall agree to 
regard “head” and “‘tail” as the only possible outcomes of the experi- 
ment. This convention simplifies the theory without affecting its ap- 
plicability. Idealizations of this type are standard practice. It is im- 
possible to measure the life span of an atom or a person without some 
error, but for theoretical purposes it is expedient to imagine that these 
quantities are exact numbers. The question then arises as to which 
numbers can actually represent the life span of a person. Is there a 
maximal age beyond which life is impossible, or is any age conceivable? 
We hesitate to admit that man can grow 1000 years old, and yet cur- 
rent actuarial practice admits no bounds to the possible duration of 
life. According to formulas on which modern mortality tables are 
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based, the proportion of men surviving 1000 years is of the order of 
magnitude of one in 10!°"—a number with 1057 billions of zeros. This 
statement does not make sense from a biological or sociological point 
of view, but considered exclusively from a statistical standpoint it cer- 
tainly does not contradict any experience. There are fewer than 1019 
people born in a century. To test the contention statistically, more 
than 1019 centuries would be required, which is considerably more 
than 10!°” lifetimes of the earth. Obviously, such extremely small 
probabilities are compatible with our notion of impossibility. Their 
use may appear utterly absurd, but it does no harm and is convenient 
in simplifying many formulas. Moreover, if we were seriously to dis- 
card the possibility of living 1000 years, we should have to accept the 
existence of a maximum age, and the assumption that it should be pos- 
sible to live x years and impossible to live x years and two seconds is 
as unappealing as the idea of unlimited life. 

Any theory necessarily involves idealization, and our first idealiza- 
tion concerns the possible outcomes of an “experiment” or ‘‘observa- 
tion.” If we want to construct an abstract model, we must at the 
outset reach a decision about what constitutes a possible outcome of 
the (idealized) experiment. 

For uniform terminology, the results of experiments or observations 
will be called events. Thus we shall speak of the event that of five coins 
tossed more than three fell heads. Similarly, the “‘experiment”’ of dis- 
tributing the cards in bridge! may result in the ‘‘event” that North 
has two aces. The composition of a sample (‘‘two left-handers in a 
sample of 85’) and the result of a measurement (‘temperature 120°,” 
“seven trunklines busy’’) will each be called an event. 

We shall distinguish between compound (or decomposable) and simple 
(or indecomposable) events. For example, saying that a throw with 
two dice resulted in “sum six” amounts to saying that it resulted in 
(1,5) or (2, 4) or (8, 3) or (4, 2) or (5, 1),” and this enumeration de- 
composes the event ‘‘sum six” into five simple events. Similarly, the 
event “two odd faces” admits of the decomposition (1, 1) or (1, 3) or 

. or (5, 5)” into nine simple events. Note that if a throw results in 


1 Definition of bridge and poker. A deck of bridge cards consists of 52 cards ar- 
ranged in four suits of thirteen each. There are thirteen face values (2, 3, ..., 10, 
jack, queen, king, ace) in each suit. The four suits are called spades, clubs, hearts, 
diamonds. The last two are red, the first two black. Cards of the same face 
value are called of the same kind. For our purposes, playing bridge means dis- 
tributing the cards to four players, to be called North, South, Hast, and West (or 
N, 5, E, W, for short) so that each receives thirteen cards. Playing poker, by de- 
finition, means selecting five cards out of the pack. 
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(8, 8), then the same throw results also in the events “sum six’ and 
“two odd faces’; these events are not mutually exclusive and hence 
may occur simultaneously. As a second example consider the age of 
a person. Every particular value x represents a simple event, whereas 
the statement that a person is in his fifties describes the compound 
event that x lies between 50 and 60. In this way every compound 
event can be decomposed into simple events, that is to say, a com- 
pound event is an aggregate of certain simple events. 

If we want to speak about “experiments” or “observations” in a 
theoretical way and without ambiguity, we must first agree on the 
simple events representing the thinkable outcomes; they define the ideal- 
ized experiment. It is usual to refer to these simple events as sample 
points, or points for short. By definition, every indecomposable result of 
the (idealized) experiment 1s represented by one, and only one, sample 
point. The aggregate of all sample points will be called the sample 
space. All events connected with a given (idealized) experiment can 
be described in terms of sample points. 

Before formalizing these basic conventions, we proceed to discuss a 
few typical examples which will play a role further on. 


2. EXAMPLES 


(a) Distribution of three balls n three cells. Table 1 describes all pos- 
sible outcomes of the “experiment” of placing three balls into three 
cells. 


TABLE 1 
1. {abe| - | - } 10. {a | be] - 3} 19. { -|a | be} 
2. { - jabe| --Ἰ 11. {δ 1α εἰ -} 20. { - | b jac} 
8. { - | - jabc} 12. { clab| -} 21. { -| clab} 
4. fab| cl -} 13. {a | - | bc} 22. {a [δ᾽] εἰ 
δ. fac| δ} -3 14. { b | - jac} 23. {a | cl δ) 
6. { bela | -Ἰ 15. { c| - jab} DAA ἃ }..:Ὁ] 
7. (αδ | -| οἱ 16. { - |ab εἰ 95. {δ} cla } 
8. fac| -| δ) 17. { -lac| δ} 26. { cla | b} 
9. { be| - ja } 18. { - | bela } 27. { cl bla } 


Fach of these arrangements represents a simple event, that is, a 
sample point. The event A “one cell is multiply occupied”’ is realized 
in the arrangements numbered 1-21, and we express this by saying 
that the event A is the aggregate of the sample points 1-21. Similarly, 
the event B “first cell is not empty” is the aggregate of the sample 
points 1, 4-15, 22-27. The event C defined by ‘“‘both A and B occur”’ 
is the aggregate of the thirteen sample points 1, 4-15. In this particu- 
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lar example it so happens that each of the 27 points belongs to either 
A or B (or to both); therefore the event “either A or B or both occur” 
is the entire sample space and occurs with absolute certainty. The 
event D defined by ‘A does not occur” consists of the points 22-27 
and can be described by the condition that no cell remains empty. 
The event “‘first cell empty and no cell multiply occupied” is impos- 
sible (does not occur) since no sample point satisfies these specifica- 
tions. 

(b) Distribution of r balls in n cells. The more general case of r balls 
in n cells can be studied in the same manner, except that the number of 
possible arrangements increases rapidly with rand ἢ. For r = 3 balls 
in n = 4 cells, the sample space contains already 64 points, and for 
r = n = 10 there are 10'° sample points; a complete tabulation would 
require some hundred thousand big volumes. 

We use this example to illustrate the important fact that the nature 
of the sample points is irrelevant for our theory. To us the sample 
space (together with the probability distribution defined in it) defines 
the idealized experiment. We use the picturesque language of balls 
and cells, but the same sample space admits of a great variety of dif- 
ferent practical interpretations. To clarify this point, and also for fur- 
ther reference, we list here a number of situations in which the intuttive 
background varies; all are, however, abstractly equivalent to the scheme of 
placing r balls into n cells, in the sense that the outcomes differ only in 
their verbal description. The appropriate assignment of probabilities is 
not the same in all cases and will be discussed later on. 


(b, 1). Birthdays. The possible configurations of the birthdays of 
r people correspond to the different arrangements of r balls in 
n = 365 cells (assuming the year to have 365 days). 

(b, 2). Accidents. Classifying r accidents according to the week- 

_ days when they occurred is equivalent to placing r balls into n = 7 
cells. 

(ὁ, 3). In firtng at n targets, the hits correspond to balls, the targets 
to cells. 

(b, 4). Sampling. Let a group of r people be classified according 
to, say, age or profession. The classes play the role of our cells, the 
people that of balls. 

(b, 5). Irradiation in biology. When the cells in the retina of the 
eye are exposed to light, the light particles play the role of balls, 
and the actual cells are the “‘cells’ of our model. Similarly, in the 
study of the genetic effect of irradiation, the chromosomes correspond 
to the cells of our model and a-particles to the balls. 
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(b, 6). In cosmic ray experiments the particles hitting the Geiger 
counters represent the balls, and the counters function as cells. 

(b, 7). An elevator starts with r passengers and stops at n floors. 
The different arrangements of discharging the passengers are replicas 
of the different distributions of r balls in n cells. 

(b, 8). Dice. The possible outcomes of a throw with r dice corre- 
spond to placing r balls into n = 6 cells. When tossing a coin Wwe are 
in effect dealing with only n = 2 cells. 

(b, 9). Random digits. The possible orderings of a sequence of r 
digits correspond to the distribution of 7 balls (= places) into ten 
cells called 0, 1, ..., 9. 

(b, 10). The sex distribution of r persons. Here we have n = 2 
cells and r balls. 

(b, 11). Coupon collecting. The different kinds of coupons repre- 
sent the cells; the coupons collected represent the balls. 

(b, 12). Aces in bridge. The four players represent four cells, and 
we have r = 4 balls. 

(b, 13). Gene distributions. Each descendant of an individual (per- 
son, plant, or animal) inherits from the progenitor certain genes. If 
a particular gene can appear in n forms Aj, ..., An, then the de- 
scendants may be classified according to the type of the gene. The 
descendants correspond to the balls, the genotypes Ai, ..., An to 
the cells. 

(b, 14). Chemistry. Suppose that a long chain polymer reacts with 
oxygen. An individual chain may react with 0, 1, 2, ... oxygen 
molecules. Here the reacting oxygen molecules play the role of balls 
and the polymer chains the role of cells into which the balls are put. 

(b, 15). Theory of photographic emulsions. A photographic plate 
is covered with grains sensitive to light quanta: a grain reacts if it 
is hit by a certain number, r, of quanta. For the theory of black- 
white contrast we must know how many cells are likely to be hit by 
the r quanta. We have here an occupancy problem where the grains 
correspond to cells, and the light quanta to balls. (Actually the 
situation is more complicated since a plate usually contains grains 
of different sensitivity.) 

(b, 16). Misprints. The possible distributions of r misprints in 
the n pages of a book correspond to all the different distributions of 
r balls in n cells, provided r is smaller than the number of letters per 


page. 


(c) The case of indistinguishable balls. Let us return to example (a) 
and suppose that the three balls are not distinguishable. This means 
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that we no longer distinguish between three arrangements such as 4, 5, 
6, and thus table 1 reduces to table 2. The latter defines the sample 


TABLE 2 
1. {#e*] - | - | 6. { * [xe | — } 
2. { — [xe] — | 7 { * | — [xe | 
3. { — | — [exe] 8. { — [ee | * } 
4. {xx | * | - } 9. { — | « [xe } 
5. {ex | — | * } 10. {Ὁ | * | * } 


space of the ideal experiment which we call “placing three indistin- 
guishable balls into three cells,’ and a similar procedure applies to the 
case of r balls in n cells. 

Whether or not actual balls are in practice distinguishable 18 irrelevant 
for our theory. Even if they are, we may decide to treat them as indis- 
tinguishable. The aces in bridge [example (b, 12)] or the people in an 
elevator [example (b, 7)| certainly are distinguishable and yet it is often 
preferable to treat them as indistinguishable. The dice of example 
(b, 8) may be colored to make them distinguishable, but whether in 
discussing a particular problem we use the model of distinguishable or 
indistinguishable balls is purely a matter of purpose and convenience. 
The nature of a concrete problem may dictate the choice, but under 
any circumstances our theory begins only after the appropriate model 
has been chosen, that is, after the sample space has been defined. 

In the scheme above we have considered indistinguishable balls, but 
table 2 still refers to a first, second, third cell, and their order is 
essential. We can go a step further and assume that even the cells 
are indistinguishable (for example, the cell may be chosen at random 
without regard to its contents). With both balls and cells indistin- 
guishable, only three different arrangements are possible, namely 
{ewe| — | — }, fee | * ]- j,i | = | ε}. 

(d) Sampling. Suppose that a sample of 100 people is taken in order 
to estimate how many people smoke. The only property of the sample 
of interest in this connection is the number x of smokers; this may be 
any integer between 0 and 100. In this case we may agree that 
our sample space consists of the 101 “points” 0, 1, ..., 100. Every 
particular sample or observation is completely described by stating 
the corresponding point x. An example of a compound event is the 
result that “the majority of the people sampled are smokers.” This 
means that the experiment resulted in one of the fifty simple events 
51, 52, ..., 100, but it is not stated in which. Similarly, every property 
of the sample can be described in enumerating the corresponding cases 
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or sample points. For uniform terminology we speak of events rather 
than properties of the sample. Mathematically, an event is simply 
the aggregate of the corresponding sample points. 

(6) Sampling (continued). Suppose now that the 100 people in our 
sample are classified not only as smokers or non-smokers but also as 
males or females. The sample may now be characterized by a quad- 
ruple (M;, Fs, Mn, Fn) of integers giving in order the number of male 
and female smokers, male and female non-smokers. We can take for 
sample points the quadruples of integers lying between 0 and 100 and 
adding to 100. There are 176,851 such quadruples, and they constitute 
the sample space (cf. chapter II, section 5). The event “relatively 
more males than females smoke” means that in our sample the ratio 
M,/M, is greater than F,/F,. The point (73, 2, 8, 17) has this prop- 
erty, but (0, 1, 50, 49) has not. Our event can be described in principle 
by enumerating all quadruples with the desired property. 

(f) Coin tossing. For the experiment of tossing a coin three times, 
the sample space consists of eight points which may conveniently be 
represented by HHH, HHT, HTH, THH, HTT, THT, TTH, TTT. 
The event A, “two or more heads,” is the aggregate of the first four 
points. The event B, “just one tail,” means either HHT, or HTH, or 
THH,; we say that B contains these three points. 

(g) Ages of a couple. An insurance company is interested in the age 
distribution of couples. Let x stand for the age of the husband, y for 
the age of the wife. Each observation results in a number-pair (2, y). 
For the sample space corresponding to a single observation we take the 
first quadrant of the 2, y-plane so that each point r > 0, y > O isa 
sample point. The event A, “husband is older than 40,” is represented 
by all points to the right of the line x = 40; the event B, ‘husband is 
older than wife,’ is represented by the angular region between the 
z-axis and the bisector y = 2, that is to say, by the aggregate of points 
with z > y; the event C, “wife is older than 40,” is represented by the 
portion of the first quadrant above the line y = 40. Fora geometric 
representation of the joint age distributions of two couples we would 
require a four-dimensional space. 

(h) Phase space. In statistical mechanics, each possible “state” of a 
system is called a “point in phase space.” This is only a difference in 
terminology. The phase space is simply our sample space; its points 
are our sample points. 


3. THE SAMPLE SPACE. EVENTS 


It should be clear from the preceding that we shall never speak of 
probabilities except in relation to a given sample space (or, physically, 
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in relation to a certain conceptual experiment). We start with the notion 
of a sample space and its poinis; from now on they will be considered given. 
They are the primitive and undefined notions of the theory precisely as the 
notions of ‘‘points’” and “straight line’ remain undefined in an axio- 
matic treatment of Euclidean geometry. The nature of the sample 
points does not enter our theory. The sample space provides a model 
of an ideal experiment in the sense that, by definition, every thinkable 
outcome of the experiment 1s completely described by one, and only one, 
sample point. It is meaningful to talk about an event A only when it 
is clear for every outcome of the experiment whether the event A has 
or has not occurred. The collection of all those sample points repre- 
senting outcomes where A has occurred completely describes the event. 
Conversely, any given aggregate A containing one or more sample 
points can be called an event; this event does, or does not, occur accord- 
ing as the outcome of the experiment is, or is not, represented by a 
point of the aggregate A. We therefore define the word event to mean 
the same as an aggregate of sample points. We shall say that an event A 
consists of (or contains) certain points, namely those representing out- 
comes of the ideal experiment in which A occurs. 


Example. In the sample space of example (2.a) consider the event 
U consisting of the points number 1, 7, 138. This is a formal and 
straightforward definition, but U can be described in many equivalent 
ways. For example, U may be defined as the event that the following 
three conditions are satisfied: (1) the second cell is empty, (2) the ball 
a is in the first cell, (3) the ball b does not appear after c. Each of 
these conditions itself describes an event. The event U;, defined by 
the condition (1) alone consists of points 1, 3, 7-9, 18-15. The event 
[750 defined by (2) consists of points 1, 4, 5, 7, 8, 10, 18, 22, 23, and the 
event U3 defined by (3) contains the points 1-4, 6, 7, 9-11, 13, 14, 16, 
18-20, 22, 24, 25. The event U can also be described as the simul- 
taneous realization of all three events U,, Use, Us. 


The terms “sample point” and “event” have an intuitive appeal, 
but they refer to the notions of point and point set common to all 
parts of mathematics. 

We have seen in the preceding example and in (2.a) that new events 
can be defined in terms of two or more given events. With these 
examples in mind we now proceed to introduce the notation of the 
formal algebra of events (that is, algebra of point sets). 
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4. RELATIONS AMONG EVENTS 


We shall now suppose that an arbitrary, but fixed, sample space © 
is given. 


Definition 1. We shall use the notation A = 0 to express that the 
event A contains no sample points (is impossible). The zero must be 
interpreted in a symbolic sense and not as the numeral. 

To every event A there corresponds another event defined by the 
condition “4 does not occur.”’ It contains all points not contained in A. 


Definition 2. The event consisting of all points not contained wn the 
event A will be called the complementary event (or negation) of A and will 
be denoted by A’. In particular, S’ = 0. 


B 
A 
B-AB 
A-AB 
Figure 1 FIGURE 2 


Figures 1 anp 2. Illustrating relations among events. In Figure 1 the domain 
within heavy boundaries is the union A U BU C. The triangular (heavily shaded) 
domain is the intersection ABC. The moon-shaped (lightly shaded) domain is the 
intersection of B with the complement of A U C. 


With any two events A and B we can associate two new events de- 
fined by the conditions ‘‘both A and B occur” and “either A or B or 
both occur.’ These events will be denoted by AB and A U B, respec- 
tively. The event AB contains all sample points which are common 
to A and B. If A and B exclude each other, then there are no points 
common to A and B and the event AB is impossible; analytically, this 
situation is described by the equation 


(4.1) AB =0 


which should be read “A and B are mutually excluswe.”’ The event 
AB’ means that both A and B’ occur or, in other words, that A but 
not B occurs. Similarly, A’B’ means that neither A nor B occurs. The 
event A U B means that at least one of the events A and B occurs; it 
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contains all sample points except those that belong neither to A nor 
to B. 

In the theory of probability we can describe the event AB as the 
simultaneous occurrence of A and B. In standard mathematical ter- 
minology AB is called the (logical) intersection of A and B. Similarly, 
A U Bis the union of A and B. Our notion carries over to the case 
of events A, B,C, D,.... 


Definition 3. To every collection A, B,C, ... of events we define two 
new events as follows. The aggregate of the sample points which belong to 
all the given sets will be denoted by ABC ... and called the intersection ” 
(or simultaneous realization) of A, B,C,.... The aggregate of sample 
points which belong to at least one of the given sets will be denoted by 
AUBUC... and called the union (or realization of at least one) of the 
given events. The events A, B,C, ... are mutually exclusive if no two 
have a point in common, that is, if AB = 0, AC = 0, ..., BC =0,.... 


We still require a symbol to express the statement that A cannot 
occur without B occurring, that is, that the occurrence of A implies the 
occurrence of B. This means that every point of A is contained in B. 
Think of intuitive analogies like the aggregate of all mothers, which 
forms a part of the aggregate of all women: All mothers are women but 
not all women are mothers. 


Definition 4. The symbols A C B and BD A are equivalent and 
signify that every point of A is contained in B; they are read, respectively, 
“A implies B” and “B is implied by A”. If this 1s the case, we shall 
also write B — A instead of BA’ to denote the event that B but not A occurs. 


The event B — A contains all those points which are in B but not 
in A. With this notation we can write A’ = Ὁ — A and A —A = 0. 


Examples. (a) If A and B are mutually exclusive, then the occur- 
rence of A implies the non-occurrence of B and vice versa. Thus 
AB = 0 means the same as A C B’ andas BC A’. 

(b) The event 4 — AB means the occurrence of A but not of both 
A and B. Thus A — AB = AB’, 

(c) In the example (2.g), the event AB means that the husband is 
older than 40 and older than his wife; AB’ means that he is older than 
40 but not older than his wife. AB is represented by the infinite trape- 


2 The standard mathematical notation for the intersection of two or more sets 
is AN Bor ANBNC, etc. This notation is more suitable for certain specific 
purposes and is to be adopted in the second volume. At present we use the nota- 
tion AB, ABC, etc., since it is less clumsy in print. 
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zoidal region between the z-axis and the lines x = 40 and y = x, and 
the event AB’ is represented by the angular domain between the lines 
ᾧ = 40 and y = 2, the latter boundary included. The event AC means 
that both husband and wife are older than 40. The event A UC 
means that at least one of them is older than 40, and A U B means 
that the husband is either older than 40 or, if not that, at least older 
than his wife (in official language, “husband’s age exceeds 40 years or 
wife’s age, whichever is smaller’). 

(4) In example (2.a) let E; be the event that the cell number 7 is 
empty (here i = 1, 2,3). Similarly, let S;, Di, ΤΊ, respectively, denote 
the event that the cell number 7 is occupied simply, doubly, or triply. 
Then E,E. = Τῷ, and SS C 83, and D,De2 = 0. Note also that 
T, 2 Es, etc. The event ἢ, U Dz U Dz is defined by the condition 
that there exist at least one doubly occupied cell. 

(e) Bridge (cf.footnote 1). Let A,B,C, Dbe the events, respectively, 
that North, South, East, West have at least one ace. It is clear that 
at least one player has an ace, so that one or more of the four events 
must occur. Hence A U B U C U D = G is the whole sample space. 
The event ABCD occurs if, and only if, each player has an ace. The 
event “West has all four aces” means that none of the three events 
A, B, C has occurred; this is the same as the simultaneous occurrence 
of A’ and B’ and C’ or the event A’B'C’. 

(7) In the example (2.9) we have BC C A; in words “if husband is 
older than wife (B) and wife is older than 40 (C), then husband is 
older than 40 (A).” How can the event A — BC be described in words? 


5. DISCRETE SAMPLE SPACES 


The simplest sample spaces are those containing only a finite num- 
ber, n, of points. If n is fairly small (as in the case of tossing a few 
coins), it is easy to visualize the space. The space of distributions of 
cards in bridge is more complicated. However, we may imagine each 
sample point represented on a chip and may then consider the collec- 
tion of these chips as representing the sample space. An event A (like 
“North has two aces”) is represented by a certain set of chips, the 
complement A’ by the remaining ones. It takes only one step from 
here to imagine a bowl with infinitely many chips or a sample space 
with an infinite sequence of points £1, Ee, Hs, ... 


Examples. (a) Let us toss a coin as oiten as necessary to turn up 
one head. The points of the sample space are then Κ᾽} = H, Ey = TH, 
Ἐξ: = TTH, E, = TTTH, etc. We may or may not consider as think- 
able the possibility that H never appears. If we do, this possibility 
should be represented by a point Ko. 
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. (Ὁ) Three players a, ὃ, c take turns at a game, such as chess, accord- 
ing to the following rules. At the start a and ὃ play while ὁ is out. 
The loser is replaced by c and at the second trial the winner plays 
against c while the loser is out. The game continues in this way until 
a player wins twice in succession, thus becoming the winner of the 
game. For simplicity we disregard the possibility of ties at the indi- 
vidual trials. The possible outcomes of our game are indicated by the 
following scheme: 


(ay - δὔν "66, acbb, acbaa, acbacc, acbacbb, acbacbaa, 
bb, bcc, beaa, beabb, bcabcc, bceabcaa, bcabcabb, 


In addition, it is thinkable that no player ever wins twice in succession, 
which means that the play continues indefinitely according to one of 
the patterns 


(**) acbacbacbach ..., beabcabcabca .... 


The sample space corresponding to our ideal “experiment” is defined 
by (*) and (**) and is infinite. It is clear that the sample points can 
be arranged in a simple sequence by taking first the two points (*+*) 
and continuing with the points of (*) in the order aa, bb, acc, bec, .... 
(This example is continued in problems 5 and 6; example V(2. a); 
problem XV, 5.) 


Definition. A sample space 1s called discrete tf it contains only finitely 
many points or infinitely many points which can be arranged into a simple 
sequence Ey, Eo, .... 


Not every sample space is discrete. It is a known theorem (due to 
G. Cantor) that the sample space consisting of all positive numbers is 
not discrete. We are here confronted with a distinction familiar in 
mechanics. There it is usual first to consider discrete mass points 
with each individual point carrying a finite mass, and then to pass to 
the notion of a continuous mass distribution, where each individual 
point has zero mass. In the first case, the mass of a system is obtained 
simply by adding the masses of the individual points; in the second 
case, masses are computed by integration over mass densities. Quite 
similarly, the probabilities of events in discrete sample spaces are ob- 
tained by mere additions, whereas in other spaces integrations are nec- 
essary. Except for the technical tools required, there is no essential 
difference between the two cases. In order to present actual probability 
considerations unhampered by technical difficulties, we shall take up 
only discrete sample spaces. It will be seen that even this special case 
leads to many interesting and important results. 


In this volume we shall consider only discrete sample spaces, 
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6. PROBABILITIES IN DISCRETE SAMPLE SPACES: 
PREPARATIONS 


The probabilities of the various events are numbers of the same 
nature as distances in geometry or masses in mechanics. The theory 
assumes that they are given but need assume nothing about their actual 
numerical values or how they are measured in practice. Some of the 
most important applications are of a qualitative nature and independ- 
ent of numerical values; the general conclusions of the theory are ap- 
plied in many ways exactly as the theorems of geometry serve as a 
basis for physical theories and engineering applications. In the rela- 
tively few instances where numerical values for probabilities are re- 
quired, the methods of procedure vary as widely as do the methods of 
determining distances. There is little in common in the practices of 
the carpenter, the practical surveyor, the pilot, and the astronomer 
when they measure distances. In our context, we may consider the 
diffusion constant, which is a notion of the theory of probability. To 
find its numerical value, physical considerations relating it to other 
theories are required; a direct measurement is impossible. By contrast, 
mortality tables are constructed from rather crude observations. In 
most actual applications the determination of probabilities, or the com- 
parison of theory and observation, requires rather sophisticated statis- 
tical methods, which in turn are based on a refined probability theory. 
In other words, the intuitive meaning of probability is clear, but only 
as the theory proceeds shall we be able to see how it is applied. All 
possible “definitions” of probability fall far short of the actual practice. 

When tossing a “good” coin we do not hesitate to associate prob- 
ability 4 with either head or tail. This amounts to saying that when 
, a coin is tossed n times all 2” possible results have the same probability. 
From a theoretical standpoint, this is a convention. Frequently, it has 
been contended that this convention is logically unavoidable and the 
only possible one. Yet there have been philosophers and statisticians 
defying the convention and starting from contradictory assumptions 
(uniformity or non-uniformity in nature). It has also been claimed 
that the probabilities 4 are due to experience. As a matter of fact, 
whenever refined statistical methods have been used to check on actual 
coin tossing, the result has been invariably that head and tail are not 
equally likely. And yet we stick to our model of an “‘ideal’’ coin, even 
though no good coins exist. We preserve the model not merely for its 
logical simplicity, but essentially for its usefulness and applicability. 
In many applications it is sufficiently accurate to describe reality. 
More important is the empirical fact that departures from our'scheme 
are always coupled with phenomena such as an eccentric position of 
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the center of gravity. In this way our idealized model can be extremely 
useful even if it never applies exactly. For example, in modern statis- 
tical quality control based on Shewhart’s methods, idealized probability 
models are used to discover “‘assignable causes” for flagrant departures 
from these models and thus to remove impending machine troubles and 
process irregularities at an early stage. 

Similar remarks apply to other cases. The number of possible dis- 
tributions of cards in bridge is almost 10°°. Usually we agree to con- 
sider them as equally probable. For a check of this convention more 
than 10°° experiments would be required—thousands of billions of 
years if every living person played one game every second, day and 
night. However, consequences of the assumption can be verified ex- 
perimentally, for example, by observing the frequency of multiple aces 
in the hands at bridge. It turns out that for crude purposes the ideal- 
ized model describes experience sufficiently well, provided the card 
shuffling is done better than is usual. It is more important that the 
idealized scheme, when it does not apply, permits the discovery of 
“assignable causes’ for the discrepancies, for example, the reconstruc- 
tion of the mode of shuffling. These are examples of limited impor- 
tance, but they indicate the usefulness of assumed models. More in- 
teresting cases will appear only as the theory proceeds. 


Examples. (a) Distinguishable balls. In example (2.a) it appears 
natural to assume that all sample points are equally probable, that is, 
that each sample point has probability s'7. We can start from this 
definition and investigate its consequences. Whether or not our model 
will come reasonably close to actual experience will depend on the type 
of phenomena to which it is applied. In some applications the assump- 
tion of equal probabilities is imposed by physical considerations; in 
others it is introduced to serve as the simplest model for a general 
orientation, even though it quite obviously represents only a crude 
first approximation (e.g., consider the examples (2.6, 1), birthdays; 
(2.b, 7), elevator problem; or (2.6, 11) coupon collecting). 

(Ὁ) Indistinguishable balls: Bose-Einstein statistics. We now turn to 
the example (2.c) of three indistinguishable balls in three cells. It is 
possible to argue that the actual physical experiment is unaffected by 
our failure to distinguish between the balls; physically there remain 27 
different possibilities, even though only ten different forms are distin- 
guishable. This consideration leads us to attribute the following prob- 
abilities to the ten points of table 2. 


Point number: 1 2 3 4 5 6 7 8 9 10 
Probability: ον oy ty ὁ ᾧ ὁ ὁ ὁ ὁ . 
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It must be admitted that for most applications listed in example (2.b) 
this argument appears sound and the assignment of probabilities rea- 
sonable. Historically, our argument was accepted for a long time with- 
out question and served in statistical mechanics as the basis for the 
derivation of the Mazxwell-Boltzmann statistics for the distribution of r 
balls in n cells. The greater was the general surprise when Bose and 
Einstein showed that certain particles are subject to the Bose-Hinstein 
statistics (for details see chapter II, section 5). In our case with 
γ =n = 8, the Bose-Einstein model attributes probability 75 to each 
of the ten sample points. 

This example will show that different assignments of probabilities 
are compatible with the same sample space and will illustrate the intri- 
cate interrelation between theory and experience. In particular, it 
teaches us not to rely too much on a priori arguments and to be pre- 
pared to accept new and unforeseen schemes. 

(c) Coin tossing. A frequency interpretation of the postulate of 
equal probabilities requires records of actual experiments. Now in 
reality every coin is biased, and it is possible to devise physical experi- 
ments which come much closer to the ideal model of coin tossing than 


TABLE 3 


number Numbers of heads Total 


0- 1,000 | 54 46 53 55 46 54 41 48 51 53) 501 
— 2000 | 48 46 40 53 49 49 48 54 53 45 | 485 
— 3,000 | 48 52 58 51 Sl 50 52 50 53 49 | 509 
-~ 4000158 60 54 55 50 48 47 57 52 55 | 536 
- 5,000 | 48 51 51 49 44 52 50 46 53 41 | 485 
~- 6,000 | 49 50 45 52 52 48 47 47 47 51 | 488 
— 7,000 | 45 47 41 51 49 59 50 55 53 50 | 500 
- 8,000 | 53 52 46 52 44 51 48 51 46 54 | 497 
αι 9000 | 45 47 46 52 47 48 59 57 45 48 | 494 
-10,000 | 47 41 51 48 59 51 52 55 39 41 | 484 


real coins ever do. To give an idea of the fluctuations to be expected, 
we give the record of such a simulated experiment corresponding to 
10,000 trials with a coin? Table 3 contains the number of occurrences 


3'The table actually records the frequency of even digits in a section of A Mullion 
Random Digits with 100,000 Normal Deviates, by The Ranp Corporation, The Free 
Press, Glenese, Illinois, 1955. 
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of “heads” in a series of 100 experiments each corresponding to a se- 
quence of 100 trials with a coin. The grand total is 4979. Looking at 
these figures the reader is very probably left with a vague feeling of: 
So what? The truth is that a more advanced theory is necessary to 
judge to what extent such empirical data agree with our abstract model. 
(Incidentally, we shall return to this material in chapter III, section 7.) 


7. THE BASIC DEFINITIONS AND RULES 


Fundamental Convention. Given a discrete sample space ὦ with 
the sample points Ey, Eo, ..., we shall assume that with each point E; there 
8 associated a number, called the probability of E; and denoted by P{E;}. 
It 1s to be non-negative and such that 


(7.1) P{£i} + P{£e} +...= 1. 


Note that we do not exclude the possibility that a point has prob- 
ability zero. This convention may appear artificial but is necessary 
to avoid complications. In discrete sample spaces probability zero is 
in practice interpreted as an impossibility, and any sample point known 
to have probability zero can, with impunity, be eliminated from the 
sample space. However, frequently the numerical values of the prob- 
abilities are not known in advance, and involved considerations are re- 
quired to decide whether or not a certain sample point has positive 
probability. 

Definition. The probability P{A} of any event A is the sum of the 
probabilities of all sample points in tt. 


The fundamental equation (7.1) states that the probability of the 
entire sample space © is unity, or P{G} = 1. It follows that for any 
event A 


(7.2) 0 < P{A} <1. 


Consider now two arbitrary events A; and Az. To compute the 
probability P{A, U 4.) that either A; or Ag or both occur, we have 
to add the probabilities of all sample points contained either in A, or 
in Ag, but each point is to be counted only once. We have, therefore, 


(7.3) P{Ai ὃ Ag} < P{Ai} + P{Ag}. 


Now, if # is any point contained both in A, and in Ag, then P{F#} 
occurs twice in the right-hand member but only once in the left-hand 


member. Therefore, the right side exceeds the left side by the amount | 


P{A,A2}, and we have the simple but important 
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Theorem. For any two events A; and Ag the probability that either 
A, or Ag or both occur 18 given by 


(7.4) P{A, U Ag} = P{Ai} + P{A2} — P{A1A9}. 


If A,Azg = 0, that is, if Αι and Ag are mutually exclusive, then (7.4) 
reduces to | 


(7.5) P{A, U Ao} = P{Ai} + P{Ag}. 


Example. A coin is tossed twice. For sample space we take the 
four points HH, HT, TH, TT, and associate with each probability 4. 
Let A; and Ag be, respectively, the events ‘“‘head at first and second 
18]. Then A, consists of HH and HT, and Az, of TH and HH. Fur- 
thermore A = A, U Az contains the three points HH, HT, and TH, 
whereas 41.4.5. consists of the single point HH. Thus 


P{A, UA} =4+4-—-4=2. 


The probability P{A,; U Az U...U An} of the realization of at least 
one among 7 events can be computed by a formula analogous to (7.4); 
this will be taken up in chapter IV, section 1. Here we note only that 
the inequality (7.3) obviously holds in general. Thus for arbitrary events 
A, Ao, ... the inequality 


(7.6) P{A, U A, U...} < P{Ai} + P{Ao} +... 


holds. In the special case where the events A;, Ag, ... are mutually 
exclusive, we have 


(7.7) P{A, U A. U...} = P{Ai} + P{A2} 4+.... 


Occasionally (7.6) is referred to as Boole’s inequality. 

We shall first investigate the simple special case where the sample 
space has a finite number, ΔΝ, of points each having probability 1/N. 
In this case, the probability of any event A equals the number of points 
in A divided by N. In the older literature, the points of the sample 
space were called “‘cases,”’ and the points of A ‘favorable’ cases (favor- 
able for A). Jf all points have the same probability, then the prob- 
ability of an event A is the ratio of the number of favorable cases to 
the total number of cases. Unfortunately, this statement has been 
much abused to provide a “‘definition”’ of probability. It is often con- 
tended that in every finite sample space probabilities of all points are 
equal. Thisisnotso. For a single throw of an untrue coin, the sample 
space still contains only the two points, head and tail, but they may 
have arbitrary probabilities p and g, with p + g = 1. A newborn baby 


24 THE SAMPLE SPACE [1.7 


is a boy or girl, but in applications we have to admit that the two 
possibilities are not equally likely. A further counterexample is pro- 
vided by (6.6). The usefulness of sample spaces in which all sample 
points have the same probability is restricted almost entirely to the 
study of games of chance and to combinatorial analysis. 


8. PROBLEMS FOR SOLUTION 


1. Among the digits 1, 2, 3, 4, 5 first one is chosen, and then a second selection 
is made among the remaining four digits. Assume that all twenty possible re- 
sults have the same probability. Find the probability that an odd digit will 
be selected (a) the first time, (b) the second time, (c) both times. 


2. In the sample space of example (2.a) attach equal probabilities to all 27 
points. Using the notation of example (4.d), verify formula (7.4) for the two 
events 41 = S; and 42 = Se. How many points does δ. contain? 


3. Consider the 24 possible arrangements (permutations) of the symbols 
1234 and attach to each probability 54. Let A; be the event that the digit 
ὦ appears at its natural place (where ὃ = 1, 2, 3, 4). Verify formula (7.4). 

4. A coin is tossed until for the first time the same result appears twice in 
succession. To every possible outcome requiring 7 tosses attribute probability 
1/2". Describe the sample space. Find the probability of the following events: 
(a) the experiment ends before the sixth toss, (b) an even number of tosses is 
required. 

5. In the sample space of example (5.5) let us attribute to each point of 
(x) containing exactly k letters probability 1/2*. (In other words, aa and bb 
carry probability 4, acb has probability %, etc.) (a) Show that the probabilities 
of the points of (*) add up to unity, whence the two points (**) receive proba- 
bility zero. (6) Show that the probability that a wins is 7%. The probability 
of ὃ winning is the same, and ὁ has probability 3 of winning. (c) The probabil- 
ity that no decision is reached at or before the kth turn (game) is 1/2*—". 

6. Modify example (5.b) to take account of the possibility of ties at the 
individual games. Describe the appropriate sample space. How would you 
define probabilities? 

7. In problem 3 show that A1;A2A3 C Ag and A,A2A’s C A's. 

8. Using the notations of example (4.d) show that (a) S1SeD3 = 0; (6) 
81D. C £3; (c) E3 — [251 D SD. 

9. Two dice are thrown. Let A be the event that the sum of the faces is 
odd, B the event of at least one ace. Describe the events AB, A U B, AB’. 
Find their probabilities assuming that all 36 sample points have equal proba- 
bilities. 

10. In example (2.9), discuss the meaning of the following events: (a) ABC, 
(b) A — AB, (c) AB’C. 

11. In example (2.9), verify that AC’ C B. 

12. Bridge (cf. footnote1). For k = 1, 2, 3, 4 let Ni, be the event that North 
has at least k aces. Let S;, E., Wi be the analogous events for South, East, 
West. What can be said about the number z of aces in West’s possession in 
the events (a) W’1, (Ὁ) N2Se, (c) NSE, (d) We — Ws, (e) NiSifiW,, (f) 
N3W,, (g) (Ne U So) Ee? 
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13. In the preceding problem verify that (a) 85 C Se, (δ) SsW2 = 0, (c) 
N2S:EiW, = 0, (ὦ NoS2 C W%, (0) (N2 U S2)Ws = 0, (f) Wa = NSE". 


14. Verify the following relations.‘ 


(a) (A U B) = ΑΒ. (ὁ) (A U B) — AB = AB’ U A’B. 
(δ) (A UB)—B=A—AB=AB’. (f) A’ U B’ = (AB). 
(δ AA=AUAS=A. (g) (A U B)C = AC_U BC. 


(ὦ (Α -- AB) UB=A UB. 
15. Find simple expressions for 
(a) (A U B)(A U B), (ὃ) (A U ByA’ U BA U B), (ΚΛ (A U BB U C). 
16. State which of the following relations are correct and which incorrect: 
(a)(AU B)-C=AU(B-C). 
(b) ABC = AB(C U B). 
(ΛΠ AUBUC=4A U (B— AB) U(C— AC). 
(d) A UB=(A—AB) UB. | 
(6) ABU BC U CA 39 ABC. 
TT UBUC). 
(g) (A υ B)—A=B. 
(hk) ABCCA ( B. 
Gi) (AUBUOC)=A'BC"’. 
(j) (A U BC = A'C U BC. 
(k) (A U B)'C = A'BC. 
(ἢ (AU BC =C—C(A ὃ B). 
17. Let A, B, C be three arbitrary events. Find expressions for the events 
that of A, B, C: 


(a) Only A occurs. (f) One and no more occurs. 
(b) Both A and B, but not C, occur. (9) Two and no more occur. 
(c) All three events occur. | (h) None occurs. 

(d) At least one occurs. (ἡ) Not more than two occur. 


(e) At least two occur. 


18. The union A U B of two events can be expressed as the union of two 
mutually exclusive events, thus: A U B = A U(B— AB). Express in a 
similar way the union of three events A, B, C. 


19. Using the result of problem 18 prove that 
P{A UBUC} =P{A} + P{B} + | 
+ P{C} — P{AB} — P{AC} — P{BC} + P{ABC}. 
[This formula is a special case of IV(1.5).] 


4 Notice that (A U B)’ denotes the complement of A U B which is not the same 
as A’ U B’. Similarly, (AB)’ is not the same as A’B’, 


CHAPTER II 


Elements 
of Combinatorial Analysis 


The purpose of this chapter is to derive a few basic formulas and to 
develop the corresponding probabilistic background. A more advanced 
reader may pass directly to chapter V where the main theoretical thread 
of chapter I is taken up again. 

In the study of simple games of chance, sampling procedures, occu- 
pancy and order problems, etc., we are usually dealing with finite sam- 
ple spaces in which the same probability is attributed to all points. 
To compute the probability of an event A we have then to divide the 
number of sample points in A (‘favorable cases’’) by the total number 
of sample points (‘‘possible cases’). ‘This is facilitated by a systematic 
use of a few rules which we shall now proceed to review. Simplicity 
and economy of thought can be achieved by adhering to a few standard 
tools, and we shall follow this procedure instead of describing the 
shortest computational method in each special case. 


1. PRELIMINARIES 


Pairs. With m elements a1, ..., Gm and n elements δι, ..., bn, it 18 
possible to form mn pairs (a;, δε) containing one element from each group. 


Proof. Arrange the pairs in a rectangular array in the form of a 
multiplication table with m rows and n columns so that (α;, δι) stands 
at the intersection of the jth row and kth column. Then each pair ap- 
pears once and only once, and the assertion becomes obvious. 


Examples. (a) Bridge cards (cf. footnote 1 to chapter I, section 1). 
As sets of elements take the four suits and the thirteen face values, 


1 The interested reader will find many topics of elementary combinatorial analysis 
treated in the classical textbook, Choice and chance, by W. A. Whitworth, fifth 
edition, London, 1901, reprinted by G. E. Stechert, New York, 1942. The com- 
panion volume by the same author, DCC exercises, reprinted New York, 1945, 
contains 700 problems with complete solutions. 
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respectively. Each card is defined by its suit and its face value, and 
there exist 4:13 = 52 such combinations, or cards. 

(Ὁ) “Seven-way lamps.”’ Some floor lamps so advertised contain 3 
ordinary bulbs and also an indirect lighting fixture which can be oper- 
ated on three levels but need not be used at all. Each of these four 
possibilities can be combined with 0, 1, 2, or 3 bulbs. Hence there are 
4-4 = 16 possible combinations of which one, namely (0, 0), means 
that no bulb is on. There remain fifteen (not seven) ways of operating 
the lamps. 


Multiplets. Given n, elements a1, ...,Gn,, and ng elemenis bi, ..., On,, 
etc., up to n, elements 11, ..., Ln,; tt 18 possible to form 11 0.2 “"" Ἦν 
ordered r-tuplets (a;,, b;,, ..., 1j,) containing one element of each kind. 


Proof. If r = 2, the assertion reduces to the first rule. If r = 3, 
take the pair (α;, b;) as element of a new kind. There are 7,2 such 
pairs and nz elements cz. Each triple (a;, ὃ), cx) is itself a pair consisting 
of (a;, b;) and an element c;; the number of triplets is therefore ningn3. 
Proceeding by induction, the assertion follows for every r. 


Perhaps the simplest and most useful way of describing the last 
theorem is as follows. To form an r-tuplet (a;,, b;,, ...,2;,) we 
have to choose one a, one b, etc. We have to perform r selections in 
all and have in succession 71, 7,2, ..., N- possibilities to choose from. 
It is asserted that this procedure can lead to 21.7.2 --- n, different 
results. 


Examples. (c) Multiple classifications. Suppose that people are 
classified according to sex, marital status, and profession. The various 
categories play the role of elements. If there are 17 professions, then 
we have 2-2-17 = 68 classes in all. 

(4) In an agricultural experiment three different treatments are to 
be tested (for example, the application of a fertilizer, a spray, and tem- 
perature). If these treatments can be applied on 71, 72, and rg levels 
or concentrations, respectively, then there exist a total of ΤΊ 275 com- 
binations, or ways of treatment. 

(e) “Placing balls into cells’’ amounts to choosing one cell for each 
ball. With τ balls we have r independent choices, and therefore r balls 
can be placed into n cells in n" different ways. It will be recalled from 
example I(2.b) that a great variety of conceptual experiments are ab- 
stractly equivalent to that of placing balls into cells. For example, 
considering the faces of a die as “‘cells,”’ the last proposition implies 
that the experiment of throwing a die r times has 6” possible outcomes, 
of which 5" satisfy the condition that no ace turns up. Assuming that 
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all outcomes are equally probable, the event “no ace in 7 throws’ has 
therefore probability (3)". We might expect naively that in six throws 
“an ace should turn up,” but the probability of this event is only 
1 — (3)° or less than 2. [Cf. example (3.5).] 


2. ORDERED SAMPLES 


Consider the set or “population” of n elements aj, dz, ..., @n. Any 
ordered arrangement aj,, a;,, ..., aj, of r symbols is called an ordered 
sample of size r drawn from our population. For an intuitive picture 
we can imagine that the elements are selected one by one. Two pro- 
cedures are then possible. First, sampling with replacement; here each 
selection is made from the entire population, so that the same element 
can be drawn more than once. The samples are then arrangements in 
which repetitions are permitted. Second, sampling without replacement; 
here an element once chosen is removed from the population, so that 
the sample becomes an arrangement without repetitions. Obviously, 
in this case, the sample size r cannot exceed the population size n. 

In sampling with replacement each of the r elements can be chosen 
in ἢ ways: the number of possible samples is therefore n’, as can be 
seen from the last theorem with n; = ng =...= n. In sampling with- 
out replacement we have n possible choices for the first element, but 
only n — 1 for the second, n — 2 for the third, etc. Using the same 
rule, we see that in this case we have n(n — 1) --- (n — r + 1) choices 
in all. Products of this type appear so often that it is convenient to 
introduce the notation 2 


(2.1) (n), = n(n -- 1) -.. (m—r+1). 
Clearly (n), = 0 for integers r > n. We have thus the following 


Theorem. For a population of n elements and a prescribed sample 
size r, there exist n" different samples with replacement and (n), samples 
without replacement. 


We note the special case where r = n. In sampling without replace- 
ment a sample of size n includes the whole population and represents 
a reordering (or permutation) of its elements. Accordingly, n elements 
αι, ..-, ἀπ can be ordered in (n), = n-(n — 1) --- 2-1 different ways. 
Instead of (n), we write n!, which is the more usual notation. We see 
that our theorem has the following 


Corollary. The number of different orderings of n elements 1s 


2 The notation (n), is not standard, but it will be used consistently in this book, 
even if n 48 not an integer. 
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Mr. and Mrs. Smith form a sample of size two drawn from the 
human population; at the same time, they form a sample of size one 
drawn from the population of all couples. This example shows that 
the sample size is defined only in relation to a given population. Toss- 
ing a coin r times is one way of obtaining a sample of size r drawn from 
the population of the two letters, H and T. The same arrangement of r 
letters H and T is a single sample point in the space corresponding to 
the experiment of tossing a coin 7 times. 

Drawing r elements from a population of size nm is an experiment 
whose possible outcomes are samples of sizer. Their number is n” or 
(n),, depending on whether or not replacement is used. In either case, 
our conceptual experiment is described by a sample space in which 
each individual point represents a sample of size r. 

So far we have not spoken of probabilities associated with our sam- 
ples. Usually we shall assign equal probabilities to all of them and then 
speak of random samples. The word “random” is not well defined, 
but when applied to samples or selections it has a unique meaning. 
Whenever we speak of random samples of fixed size r, the adjectwe ran- 
dom is to imply that all possible samples have the same probability, namely, 
n~ in sampling with replacement and 1/(m), in sampling without re- 
placement, n denoting the size of the population from which the sample 
is drawn. If n is large and r relatively small, the ratio (m),/n” 1s near 
unity. This leads us to expect that, for large populations and relatively 
small samples, the two ways of sampling are practically equivalent [cf. 
problems 11.1, 11.2, and VI, 35]. 

We have introduced a practical terminology but have made no state- 
ments about the applicability of our model of random sampling to 
reality. Tossing coins, throwing dice, and similar activities may be 
interpreted as experiments in practical random sampling with replace- 
ments, and our probabilities are numerically close to frequencies ob- 
served in long-run experiments, even though perfectly balanced coins 
or dice do not exist. Random sampling without replacement is typified 
by successive drawings of cards from a shuffled deck (provided shuffling 
+3 done much better than is usual). In sampling human populations 
the statistician encounters considerable and often unpredictable diffi- 
culties, and bitter experience has shown that it is difficult to obtain 
even a crude image of randomness. 


Exercise. In sampling without replacement the probability for any fixed ele- 
ment of the population to be included in a random sample of size r is 


1 -- (πα -- 1), -- (πη), =1—(n—7r)/n=r/n. 
In sampling with replacement the corresponding probability is 1 — {(n — 1)/n}". 
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3. EXAMPLES 


We consider random samples of size r with replacement taken from a 
population of the n elements αι, ..., d,. We are interested in the 
event A that in such a sample (a;,, ..., a;,) no element appears twice, 
that is, that our sample could have been obtained also by sampling 
without replacement. The last theorem shows that there exist n” dif- 
ferent samples in all, of which (n), satisfy the stipulated condition. 
Assuming that all arrangements have equal probability, we conclude 
that the probability of no repetition in our sample is 


a, ys eer 
n nN 


The following concrete interpretations of this formula will reveal sur- 
prising features. 

(a) Random sampling numbers. Let the population consist of the 
ten digits 0, 1, ..., 9. Every succession of five digits represents a 
sample of size r = 5, and we assume that each such arrangement has 
probability 10~°. By (8.1), the probability that five consecutive random 
digits are all different is p = (10)510~° = 0.3024. 

We expect intuitively that in large mathematical tables having many 
decimal places the last five digits will have many properties of ran- 
domness. (In ordinary logarithmic and many other tables the tabu- 
lar difference is nearly constant, and the last digit therefore varies 
regularly.) As an experiment, sixteen-place tables? were selected and 
the entries were counted whose last five digits are all different. In the 
first twelve batches of a hundred entries each, the number of entries 
with five different digits varied as follows: 30, 27, 30, 34, 26, 32, 37, 
36, 26, 31, 36, 82. Small-sample theory shows that the magnitude of 
the fluctuations is well within the expected limits. The average fre- 
quency is 0.3142, which is rather close to the theoretical probability, 
0.3024 [cf. example VII(3,/)]. , 

Consider next the number e = 2.71828.... The first 800 decimals 4 
form 160 groups of five digits each, which we arrange in sixteen batches 
of ten each. In these sixteen batches the numbers of groups in which 
all five digits are different are as follows: 


3, 1, 3, 4, 4, 1, 4, 4, 4, 2, 3, 1, 5, 4 6, 83. 


The frequencies again oscillate around the value 0.3024, and small- 


ὁ Tables of probability functions, vol. I, National Bureau of Standards, 1941. 
4 Intermédiaire des recherches mathématiques, vol. 2, 1946, p. 112. 
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sample theory confirms that the magnitude of the fluctuations is not 
larger than should be expected. The overall frequency of our event 
in the 160 groups is τοῦς = 0.325, which is reasonably close to 
p = 0.3024. 

(b) If n balls are randomly placed into n cells, the probability that each 
cell will be occupied isn!/n". This probability is surprisingly small: for 
n = 7it is only 0.00612.... This means that ζ} in a city seven accidents 
occur each week, then (assuming that all possible distributions are equally 
likely) practically all weeks will contain days with two or more accidents, 
and on the average only one week out of 165 will show a unsform distribu- 
tion of one accident per day. This example shows an unexpected char- 
acteristic of pure randomness. (All possible configurations of seven 
balls in seven cells are exhibited in table 1, section 5. With probability 
about 0.87 it will be observed that two or more cells remain empty.) 
For n = 6 the probability n!n~™ equals 0.01543.... This shows how 
extremely improbable it is that in six throws with a perfect die all 
faces turn up. [The probability that a particular face does not turn 
up is about 4; cf. example (1.e).] 

(c) Elevator. An elevator starts with r = 7 passengers and stops at 
n = 10 floors. What is the probability » that no two passengers leave 
at the same floor? To render the question precise, we assume that all 
arrangements of discharging the passengers have the same probability 
(which is a crude approximation). Then 


p = 1077(10)7 = (10-9-8-7-6-5-4)10~” = 0.06048. 


When the event was once observed, the occurrence was deemed re- 
markable and odds of 1000 to 1 were offered against a repetition. 
(Cf. the answer to problem 10.43.) 

(d) Birthdays. The birthdays of r people form a sample of size r 
from the population of all days in the year. The years are not of equal 
length, and we know that the birth rates are not quite constant through- 
out the year. However, in a first approximation, we may take a random 
selection of people as equivalent to a random selection of birthdays and 
consider the year as consisting of 365 days. 

With these conventions we can interpret equation (3.1) to the effect 
that the probability, p, that all r birthdays are different 18 ὃ 


ἘΠ ME ()-)(-) 
δ) are 365 365/ 365 


δ Cf. R. von Mises, Ueber Aufteilungs- und Besetzungs-Wahrscheinlichkeiten, 
Revue de la Faculté des Sciences de V Université d’Istanbul, N.S. vol. 4 (1938-1939), 
pp. 145-163. 


. 32 COMBINATORIAL ANALYSIS [1.8 


Again the numerical consequences are astounding. Thus for r = 23 
people we have p < 3, that is, for 23 people the probability that at least 
two people have a common birthday exceeds 3. 

Formula (3.2) looks forbidding, but it is easy to derive good numeri- 
cal approximations to p. If 7 is small, we can neglect all cross products 
and have in a crude approximation ° 


(3.3) re iti cea ea ee Δ. 1) 
365 730 
For r = 10 the correct value is p = 0.883...; equation (3.3) gives the 
approximation 0.877. 
For larger r we obtain a much better approximation by passing to 
logarithms. For small positive x we have log (1 — +) ~ —z, and thus 
from (3.2) 


ee ee oe ἢ =i) 
3.4 ier πο τυ 
(3.4) Cee 365 730 


For r = 30 this leads to the approximation 0.3037 whereas the correct 
value is p = 0.294. For r < 40 the error in (3.4) is less than 0.08. 
(For a continuation see section 7. See also answer to problem 10.44.) 


4. SUBPOPULATIONS AND PARTITIONS 


As before, we use the term population of size n to denote an aggregate 
of n elements without regard to their order. Two populations are con- 
sidered different if one contains an element not contained in the other. 
Choosing r elements out of a given population of size n means forming 
a subpopulation of sizer. In how many ways can this be done? Each 
subpopulation of size r can be arranged in τῇ different orders and in this 
way produces r! different samples without repetition. Conversely, 
each such sample of size r contains r different elements and thus defines 
a subpopulation of size r. We know that there exist (n), samples of 
the described sort. If z is the number of subpopulations of size r, then 
obviously the number of ordered samples is x-r!, and we conclude that 
¢ = (n),/r!. Numbers of this kind are known as binomial coefficients, 
and the standard notation for them 18 


ee bet 


(5 rho Dee (Der 


T 


6 The sign ~ signifies that the equality is only approximate. 
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We have now proved 

Theorem 1. A population of n elements possesses (") different sub- 
populations of sizer <n. ‘ 

In other words, out of » elements, we can choose a group of r ele- 
3 different ways. Now choosing the r elements to be taken 


out of the given population amounts to the same as choosing the 
n — r elements which are to stay in. It is therefore clear that for each 
γ <n we must have 


oo ()=(.5) 


To prove equation (4.2) directly we observe that an alternative way 
of writing the binomial coefficient (4.1) is 


(4.3) (") εξ eS 


[This follows on multiplying numerator and denominator of (4.1) by 
(n — r)!.] Note that the left side in equation (4.2) is not defined for 
r = 0, but the right side is. In order to make equation (4.2) valid for 
all integers r such that 0 < r < n, we now define 


n 
(4.4) ( ) = 1, 0! = 1, 
and (n)o = 1. 


Examples. (a) Bridge and poker (cf. footnote 1 of chapter I). 
Since the order of the cards in a hand is irrelevant, the last theorem 


52 
shows that there exist ( ) = 635,013,559,600 different hands at 


_ fn 
ments 1n 


bridge, and (*") = 2,598,960 hands at poker. Let us calculate the 
probability, x, that a hand at poker contains five different face values. 
These face values can be chosen in 6 ways, and corresponding to 
each card we are free to choose one of the four suits. It follows that 


13 52 
a = 45. ( : ) + ( 2 ) which is approximately 0.5071. For bridge the 


probability of thirteen different face values ig 413 (:) or, approxi- 
mately, 0.0001057. 
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(b) Each of the 48 states has two senators. We consider the events 
that in a committee of 48 senators chosen at random: (1) a given state 
is represented, (2) all states are represented. 

In the first case it is better to calculate the probability g of the com- 
plementary event, namely, that the given state is not represented. 
There are 96 senators, and 94 not from the given state. Hence, 


94 96 48 - 47 
q = ( )+ ( ) = = 0.247387.... 
48 48 96 - 95 


Next, the theorem of section 2 shows that a committee including 
one senator from each state can be chosen in 2*° different ways. The 
probability that all states are included in the committee is, therefore, 


96 
p= 25 + co . Using Stirling’s formula (cf. section 9), it can be 


shown that p ~ (37)!2~*6 = 4-107". 

(c) An occupancy problem. Consider once more a fantom distribu- 
tion of r balls in n cells (.e., each of the n” possible arrangements has 
probability n~"). To find the probability, p,, that a specified cell con- 
tains exactly k balls (k = 0, 1, ..., r) we note that the k balls can be 


r 
chosen in 8 ways, and the remaining r — k balls can be placed into 


the remaining n — 1 cells in (n — 1)"* ways. It follows that 


oy noe) 


This is a special case of the so-called binomial distribution which will 
be taken up in chapter VI. Numerical values will be found in table 3 
of chapter IV. 

(d) Orderings involving two kinds of elements. Consider a panaiecan 
of n = a + ὃ elements, of which a are of one kind and ὃ of another 
kind. For convenience we denote the elements by αἱ, ao, ..., da, 81, 
Be, ..-, By. These elements can be ordered in n! different ways. How- 
ever, if we agree to treat both the alphas and the betas as indistin- 
guishable among themselves (that is, if we omit their subscripts), then 
certain orderings become indistinguishable. In fact, an ordering is now 
completely described by specifying the a places occupied by the alphas, 


. 4α - ὃ at+b\ .. 
and these a places can be chosen in = , different ways. 
a δ Gy 


Accordingly, a population of a indistingutshable alphas and ὃ indistin- 
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_ (a+b a+b\.. 
guishable betas can be arranged in = 5 distinguishable 
a, 


orders. (For example, the sequence αααββ can be ordered in ten dis- 
tinguishable ways.) Any permutation among the alphas, or among 
the betas, will leave the outer appearance unchanged so that a!b! per- 
mutations have the same outer appearance. It follows that zf we attrib- 
ute to each of the (a + b)! permutations the same probability 1 = (a + b)!, 
then all distinguishable arrangements are equally probable, each having 
probability afb! + (a+ b)!. Thus, if we speak about equally probable 
arrangements, the term applies both to distinguishable arrangements 
and to the aggregate of all permutations of the elements. (This stands 
in marked contrast to the case of random placements of balls into cells— 
see section 5.) 


Theorem 2. Let τι, ..., Τὰ be antegers such that 
(4.6) γι ἜΤ Ἔ...- τὰκ ΞΞ ἢ, r; > 0. 


The number of ways in which a population of n elements can be divided 
into k ordered parts (partitioned into k subpopulations) of which the first 
contains γι elements, the second ro elements, etc. 18 


n! 
4.7 Se 
(ἢ γι rol +++ rz! 
(The numbers (4.7) are called multinomial coefficients.) 

[Note that the order of the subpopulations is essential in the sense 
that (7; = 2, re = 3) and (τι = 3, re = 2) represent different parti- 
tions; however, no attention is paid to the order within the groups. 
Note also that 0! = 1 so that the vanishing 7; in no way affect for- 
mula (4.7).] 


Proof. A repeated use of (4.3) will show that the number (4.7) may 
be rewritten in the form . 


© OCI C 


On the other hand, in order to effect the desired partition, we have 
first to select 7, elements out of the given n; of the remaining ἢ — Τὶ 
elements we select a second group of size re, etc. After forming the 
(k—1)st group there remain n — ry — rg —...— Th-1 = Tx elements, 
and these form the last group. We conclude that (4.8) indeed repre- 
sents the number of ways in which the operation can be performed. 
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Examples. (6) Bridge. At a bridge table the 52 cards are parti- 
tioned into four equal groups and therefore the number of different 
situations is 52!-(13!)* = (5.36...)-1078. Let us now calculate the 
probability that each player has an ace. The four aces can be ordered 
in 4! = 24 ways, and each order represents one possibility of giving 
one ace to each player. The remaining 48 cards can be distributed in 
(48) 421) ways. Hence the required probability is 


24-48!-(13)* + 52! = 0.105... 


(f) Dice. A throw of twelve dice can result in 6’? different out- 
comes, to all of which we attribute equal probabilities. The event that 
each face appears twice can occur in aS many ways as twelve dice can 
be arranged in six groups of two each. Hence the probability of the 
event is 12!/(2°-6!*) = 0.003438.... 


(In theorem 2 it is permitted that 7; = 0 so that in reality the n 
elements are divided into k or fewer subpopulations. The case r; > 0 
of partitions into exactly k classes is treated in problem 11.7.) 


*5, APPLICATION TO OCCUPANCY PROBLEMS 


The examples of chapter I, section 2, indicate the wide applicability 
of the model of placing randomly r balls into n cells. We now turn to 
a discussion of this model, assuming, of course, that each of the n” 
possible distributions has probability π΄. The most important prop- 
erties of a particular distribution are expressed by its occupancy num- 
bers 71, ..-, 1% where τὶ is the number of balls in the 7th cell. Here 


(5.1) γι ἜΤ τι. ΕΥ̓ =7, r; > 0. 


We agree to treat the balls as indistingutshable. The distribution of balls 
is then completely described by its occupancy numbers, and two dis- 
tributions are distinguishable only if the corresponding ordered n-tuples 
(71, ..-,7n) are not identical. Our first aim is to prove the 


Lemma. The number of distinguishable distributions [1.e. the number 
of different solutions of equation (5.1)] 18 


(5.2) y -(Ὑγ ἢ - ὙΠ τἢ. 
᾿ cal r n—1 


*The material of this section is useful and illuminating but will not be used 
explicitly in the sequel. 
7 The special case r = 100, n = 4 has been used in example I (2.6). 


15] OCCUPANCY PROBLEMS 37 


The number of distinguishable distributions in which no cell remains 


_fr-1 
emply 18 ee 


Proof. We use the artifice of representing the n cells by the space 
between 7 + 1 bars and the balls by stars. Thus | *#*|#| | | | Ἐπ] is 
used as a symbol for a distribution of r = 8 balls in n = 6 cells with 
occupancy numbers 3, 1, 0, 0, 0, 4. Such a symbol necessarily starts 
and ends with a bar, but the remaining n — 1 bars and r stars can 
appear in an arbitrary order. In this way it becomes apparent that 
the number of distinguishable distributions equals the number of ways 
of selecting r places out of n +r — 1. The condition that no cell be 
empty imposes the restriction that no two bars be adjacent. The r 
stars leave r — 1 spaces of which n — 1 are to be occupied by bars: 


Ὲ 
thus we have ( 


) choices and the lemma is proved. 


r+ 5 
Examples. (a) There are ( : ) distinguishable results of a 


throw with r indistinguishable dice. 

(b) Partial derivatives. The partial derivatives of order r of an ana- 
lytic function f(x, ...,2n) of n variables do not depend on the order 
of differentiation but only on the number of times that each variable 
appears. Thus each variable corresponds to a cell, and hence there 


n+r—-1 Baie ote 
exist ( ) different partial derivatives of rth order. A function 
r 


of three variables has fifteen derivatives of fourth order and 21 deriva- 
tives of fifth order. 


Placing r balls into n cells is one way of partitioning the population 
of r balls. By theorem 2 of section 4 there exist r! + (11!-1r2! +--+ Tn!) 
distributions with given occupancy numbers Τὶ, ..., Tn. This formula 
still involves the order in which the occupancy numbers, or cells, appear, 
but frequently this order is immaterial. The following example is in- 
tended to illustrate an exceedingly simple and routine method of solv- 
ing many elementary combinatorial problems. 


Example. (ὁ) Configurations of r = 7 balls in n τ 7 c€élls. (The 
cells may be interpreted as days of the week, the balls as calls, letters, 
accidents, etc.) For the sake of definiteness let us consider the dis- 
tributions with occupancy numbers 2, 2, 1, 1, 1, 0, 0 appearing in an 
arbitrary order. These seven occupancy numbers induce a partition of 
the seven cells into three subpopulations (categories) consisting, respec- 
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tively, of the two doubly occupied, the three simply occupied, and the 
two empty cells. Such a partition into three groups of size 2, 3, and 2 
can be effected in 7! + (2!-3!-2!) ways. To each particular assign- 
ment of our occupancy numbers to the seven cells there correspond 
71 -- (21.2}.11.11.11.0}.0) = 7! + (21-2!) different distributions of 
the r = 7 balls into the seven cells. Accordingly, the total number of 
distributions such that the occupancy numbers coincide with 2, 2, 1, 1, 1, 
0, 0 zn some order is 


(5.3) 


7! γί 
aI3ta! 2] 


It will be noticed that this result has been derived by a double applica- 
tion of (4.7), namely to balls and to cells. The same result can be de- 
rived and rewritten in many ways, but the present method provides 
the simplest routine technique for a great variety of problems. (Cf. 
problems 43-45 of section 10.) Table 1 contains the analogue to (5.3) 
and the probabilities for all possible configurations of occupancy num- 
bers in the case r = n = 7. 


TABLE 1 


RANDOM DISTRIBUTIONS OF 7 BALLS IN 7 CELLS 


| Number of Probability (Number 
Occupancy Arrangements Equals of Arrangements 
Numbers 7! Χ 7! Divided by Divided by 77) 
ΤΊ ΤΊ Wx 1! 0.006 120 
2,1,1,1,1,1,0 5! Χ 2! 128 518 
2,2, 1,1, 1, 0,0 2!13!2! Χ 212! 321 295 
2, 2, 2, 1, 0, 0, 0 313! Χ 21212! .107 098 
3, 1, 1, 1, 1, 0, 0 412! x 3! .107 098 
3, 2, 1, 1, 0, 0, 0 213! Χ 3121 .214 197 
3, 2, 2, 0, 0, 0,0 2.41] Χ 8212! .026 775 
3, 3, 1, 0, 0, 0, 0 214] & 313! .017 850 
4,1, 1,1, 0, 0,0 313! Χ 4! .035 699 
4, 2,1, 0, 0, 0, 0 4! x 4/2! 026 775 
4, 3, 0, 0, 0, 0, 0 5! & 413! 001 785 
5, 1, 1, 0, 0, 0, 0 2141] x 5! .005 355 
5, 2, 0, 0, 0, 0, 0 5! Χ 512! .001 071 
6, 1, 0, 0, 0, 0, 0 5! x 6! .000 357 
7, 0, 0, 0, 0, 0, 0 6! Χ 7! .000 008 


~~ 
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Note on Bose-Einstein and Fermi-Dirac statistics. Up to now we have 
assumed that each of the n’ possible distributions has probability π΄. It is of 
interest that facts and experience have compelled physicists to abandon this hy- 
pothesis and to assign probabilities in different ways. 
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Consider a mechanical system of r indistinguishable particles. In statistical 
mechanics it is usual to subdivide the phase space into a large number, n, of small 
regions or cells so that each particle is assigned to one cell. In this way the state 
of the entire system is described in terms of a random distribution of the r particles 
in n cells. Offhand it would seem that (at least with an appropriate definition of 
the n cells) all n” arrangements should have equal probabilities. If this is true, the 
physicist speaks of Mazwell-Boltzmann statistics (the term ‘statistics’ is here used 
in ὃ sense peculiar to physics). Numerous attempts have been made to prove that 
physical particles behave in accordance with Maxwell-Boltzmann statistics, but 
modern theory has shown beyond doubt that this statistics does not apply to any 
known particles; in no case are all n” arrangements approximately equally probable. 
Two different probability models have been introduced, and each describes satis- 
factorily the behavior of one type of particle. The justification of either model 
depends on its success. Neither claims universality, and it is possible that some 
day a third model may be introduced for certain kinds of particles. 

Remember that we are here concerned only with indistinguishable particles. We 
have r particles and n cells. By Bose-Einstein statistics we mean that only distin- 
guishable arrangements are considered and that each is assigned probability 


ἫΝ —l 
(5.4) [ ἜΣ 7 , 
Tr 


It is shown in statistical mechanics that this assumption holds true for photons, 
nuclei, and atoms containing an even number of elementary particles. To describe 
other particles a third possible assignment of probabilities must be introduced. 
Fermi-Dirac statistics is based on these hypotheses: (1) ἐξ 1s impossible for two or 
more particles to be in the same cell, and (2) all distinguishable arrangements satisfying 
the first condition have equal probabilities. The first hypothesis requires that r < n. 
An arrangement is then completely described by stating which of the n cells con- 
tain a particle; and since there are r particles, the corresponding cells can be chosen 


n 
in (*) ways. Hence, with Fermi-Dirac statistics there are in all (") possible ar- 
-1 
n | 
rangements, each having probability (") . This model applies to electrons, neu- 


_ trons, and protons. We have here an instructive example of the impossibility of 
selecting or justifying probability models by a@ priori arguments. In fact, no pure 
reasoning could tell that photons and protons would not obey the same probability 
laws. (Essential differences between Maxwell-Boltzmann and Bose-Einstein statis- 
ties are discussed in section 11, problems 14-19.) | 

To sum up: the probability that cells number 1, 2, ..., n contain τὰ, re, ..., Tn balls, 
respectively (where τι + ... +1n = τ) equals 


! 
(5.5) , a ΕἸΕΒΕΣΣ n~* 
rylrol +++ rp! 


under Maxwell-Bolizmann statistics; ut 18 given by (5.4) under Bose-Einstein statistics; 


8 Cf. H. Margenau and G. M. Murphy, The mathematics of physics and neni: 
New York, 1948, Chapter 12. 
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-1 
and it equals (") under Fermi-Dirac statistics provided each r; equals 0 or 1. 


Note that “Maxwell-Boltzmann statistics’ is the physicist’s term for what we call 
random placement of balls into cells. 


Examples. (a) Let n = 5,r = 3. The arrangement (*|—|*|+*|—) has probability 
zis, g's, or τς, according to whether Maxwell-Boltzmann, Bose-Einstein, or Fermi- 
Dirac statistics is used. See also example [(6.5). 

(b) Misprints. A book contains n symbols (letters), of which r are misprinted. 
The distribution of misprints corresponds to a distribution of r balls in n cells with 
no cell containing more than one ball. It is therefore reasonable to suppose that, 
approximately, the misprints obey the Fermi-Dirac statistics. (Cf. problem 10.38.) 


5a. Application to Runs. In any ordered sequence of elements of two kinds, 
each maximal subsequence of elements of like kind is called a run. For example, 
the sequence αααβααβββα opens with an alpha run of length 3; it is followed by runs 
of length 1, 2, 3, 1, respectively. The alpha and beta runs alternate so that the 
total number of runs is always one plus the number of unlike neighbors in the given 
sequence. 

Examples of applications. The theory of runs is applied in statistics in many 
ways, but its principal uses are connected with tests of randomness or tests of 
homogeneity. ; 

(a) In testing randomness, the problem is to decide whether a given observation 
is attributable to chance or whether a search for assignable causes is indicated. 
As a simple example suppose that an observation " yielded the following arrange- 
ment of empty and occupied seats along a lunch counter: FOHEOEEEOEEEOEOE. 
Note that no two occupied seats are adjacent. Can this be due to chance? With 
five occupied and eleven empty seats it is impossible to get more than eleven runs, 
and this number was actually observed. It will be shown later that if all arrange- 
ments were equally probable the probability of eleven runs would be 0.0578. ... 
This small probability to some extent confirms the hunch that the separations ob- 
served were intentional. This suspicion cannot be proved by statistical methods, 
but further evidence could be collected from continued observation. If the lunch 
counter were frequented by families, there would be a tendency for occupants to 
cluster together, and this would lead to relatively small numbers of runs. Similarly, 
counting runs of boys and girls in a classroom might disclose the mixing to be better 
or worse than random. Improbable arrangements give clues to assignable causes; 
an excess of runs points to intentional mixing, a paucity of runs to intentional cluster- 
ing. It is true that these conclusions are never foolproof, but efficient statistical 
techniques have been developed which in actual practice minimize the risk of in- 
correct conclusions. 

The theory of runs is also useful in industrial quality control as introduced by 
Shewhart. As washers are produced, they will vary in thickness. Long runs of 
thick washers may suggest imperfections in the production process and lead to 
the removal of the causes; thus oncoming trouble may be forestalled and greater 
homogeneity of product achieved. 

In biological field experiments successions of healthy and diseased plants are 


°F. 5. Swed and C. Eisenhart, Tables for testing randomness of grouping in a 
sequence of alternatives, Annals of Mathematical Statistics, vol. 14 (1943), pp. 66-87. 
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counted, and long runs are suggestive of contagion. The meteorologist watches suc- 
eessions of dry and wet months 19 to discover clues to a tendency of the weather to 
persist. 

(b) To understand a typical problem of homogeneity, suppose that two drugs have 
been applied to two sets of patients, or that we are interested in comparing the 
efficiency of two treatments (medical, agricultural, or industrial). In practice, we 
shall have two sets of observations, say, a1, a2, ..., @q and 1, Be, ..., 8 correspond- 
ing to the two treatments or representing a certain characteristic (such as weight) 
of the elements of two populations. The alphas and betas are numbers which 
we imagine ordered in increasing order of magnitude: a; < ag <...<aq and 
Bi < Be <...< B». We now pool the two sets into one sequence ordered according 
to magnitude. An extreme case is that all alphas precede all betas, and this may 
be taken as indicative of a significant difference between the two treatments or 
populations. On the other hand, if the two treatments are identical, the alphas and 
betas should appear more or less in random order. Wald and Wolfowitz" have 
shown that the theory of runs can be often advantageously applied to discover 
small systematic differences. (An illustrative example, treated by a different 
method, will be found in chapter ITI, section 1.) 


Many problems concerning runs can be solved in an exceedingly simple manner. 
Given a indistinguishable alphas and ὃ indistinguishable betas, we know from ex- 


b 
ample (4.d) that there are ( 7 ) distinguishable orderings. If there are πι alpha 


runs, the number of beta runs is necessarily one of the numbers nj +1 or ny. 
Arranging the a alphas in πὶ runs is equivalent to arranging them into n, cells, 


a-—l 
none of which is empty. By the last lemma this can be done in (: ) distin- 
ie 


α--ὶ ὃ -- 1 
guishable ways. It follows, for example, that there are (° 1) ( ᾿ ) ar- 
i> 1 


rangements with n; alpha runs and 7; + 1 beta runs (continued in problems 20-25 
of section 11). 

(c) In physics, the theory of runs is used in the study of cooperative phenomena. 
In Ising’s theory of one-dimensional lattices the energy depends on the number of 
unlike neighbors, that is, the number of runs. 


6. THE HYPERGEOMETRIC DISTRIBUTION 


Many combinatorial problems can be reduced to the following form. 
In a population of n elements 7; are red and 7,2 = ἢ — n, are black. 
A group of r elements is chosen at random. We seek the probability 
gq, that the group so chosen will contain exactly k red elements. Here 
k can be any integer between zero and ἢ: or r, whichever is smaller. 


10 W. G. Cochran, An extension of Gold’s method of examining the apparent 
persistence of one type of weather, Quarterly Journal of the Royal Meteorological 
Society, vol. 64, No. 277 (1938), pp. 631-634. 

11 A, Wald and J. Wolfowitz, On a test whether two samples are from the same 
population, Annals of Mathematical Statistics, vol. 2 (1940), pp. 147-162. 
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To find g,, we note that the chosen group contains k red and r — k 


n 
black elements. The red ones can be chosen in ( ‘) different ways 


and the black ones in ( μ) ways. Since any choice of k red ele- 


γ - 
ments may be combined with any choice of black ones, we find 


(Cea) 
k r—k 
(6.1) (SS 
n 
(’) 
The system of probabilities so defined is called the hypergeometric dis- 
tribution. Using formula (4.3), it is possible to rewrite (6.1) in the 


form 
ὠῴ τυ 
k ny k 
(6.2) 1 asec a cea 


oe 


Note. The probabilities q;, are defined only for k not exceeding r or 
a 
n,. However, from the definition (4.1) it follows that (Ω = Q when- 


ever b >a. Therefore, formulas (6.1) and (6.2) give g, = 0 if either 
k>mnork>r. Accordingly, the definitions (6.1) and (6.2) may be 
used for all k > 0, provided the relation g;, = Ὁ is interpreted as im- 
possibility. 


Examples. (a) Quality inspection. In industrial quality control, 
lots of size n are subjected to sampling inspection. The defective 
items in the lot play the role of “red” elements. Their number 7; is, 
of course, unknown. A sample of size r is taken, and the number αὶ 
of defective items in it is determined. Formula (6.1) then permits us 
to draw inferences about the likely magnitude of 7; this is a typical 
problem of statistical estimation and is beyond the scope of the 
present book. 

(ὃ) In example (4.6), the population consists of n = 96 senators of 
whom n, = 2 represent the given state (are “red”). A group of 


12 The name is explained by the fact that the generating function (cf. chapter XI) 
of {g,} can be expressed in terms of hypergeometric functions. 
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r = 48 senators is chosen at random. It may include k = 0, 1, or 2 
senators from the given state. From (6.2) we find, remembering (4.4), 


48-47 
96-95 


48 
40 = 92 = = 0.24737..., αι = 95 7 0-50527.... 


The value gp was obtained in a different way in example (4.b). 

(c) Estimation of the size of an animal population from recapture data." 
Suppose that 1000 fish caught in a lake are marked by red spots and 
released. After a while a new catch of 1000 fish is made, and it is 
found that 100 among them have red spots. What conclusions can be 
drawn concerning the number of fish in the lake? This is a typical 
problem of statistical estimation. It would lead us too far to describe 
the various methods that a modern statistician might use, but we shall 
show how the hypergeometric distribution gives us a clue to the solu- 
tion of the problem. We assume naturally that the two catches may 
be considered as random samples from the population of all fish in the 
lake. (In practice this assumption excludes situations where the two 
catches are made at one locality and within a short time.) We also 
suppose that the number of fish in the lake does not change between 
the two catches. 

We generalize the problem by admitting arbitrary sample sizes. Let 


n = the (unknown) number of fish in the lake. 
nm, = the number of fish in the first catch. They play the role of 
red balls. 
r = the number of fish in the second catch. 
k = the number of red fish in the second catch. 
qx(n) = the probability that the second catch contains exactly k red 
fish. 


In this formulation it is rather obvious that q,(n) is given by (6.1). 
In practice m, r, and k can be observed, but πὶ is unknown. Notice, 
incidentally, that n is a fixed number which in no way depends on 
chance. It is, therefore, meaningless to ask for the probability that n 
is greater than, say, 6000. We know that n; + 17 — k different fish 
were caught, and therefore n > πὶ +r-—k. This is all that can be 


18 This example was used in the first edition without knowledge that the method 
is widely used in practice. Newer contributions to the literature include N. T. J. 
Bailey, On estimating the size of mobile populations from recapture data, Bio- 
metrika, vol. 38 (1951), pp. 293-306, and D. G. Chapman, Some properties of the 
hypergeometric distribution with applications to zoological sample censuses, Uni- 
versity of California Publications in Statistics, vol. 1 (1951), pp. 1381-160. 
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said with certainty. In our example we had n, = r = 1000 and 
k = 100, and it is conceivable that the lake contains only 1900 fish. 
However, starting from this hypothesis, we are led to the conclusion 
that an event of a fantastically small probability has occurred. In fact, 
assuming that there are n = 1900 fish in all, the probability that two 
samples of size 1000 each will between them exhaust the entire popula- 
tion is by (6.1), 


peg beet jon (1000!)? 

100 7 \900/7 \1000 10011900! 

Stirling’s formula (cf. section 9) shows this probability to be of the 
order of magnitude 10~**°, and in this situation common sense bids us 
to reject our hypothesis as unreasonable. A similar reasoning would 
induce us to reject the hypothesis that n is very large, say, a million. 
This consideration leads us to seek the particular value of n for which 
q,(n) attains its largest value, since for that n our observation would 
have the greatest probability. For any particular set of observations 
m4, 7, k, the value of n for which g;(n) is largest is denoted by ἢ and is 


called the maximum Itkelthood estimate of ἢ. This notion was intro- 
duced by R. A. Fisher. To find ἢ consider the ratio 


an) (n—m)(n— τ) 


ee) ga(n—1 (n-m—r+k)n 

A simple calculation shows that this ratio is greater than or smaller 
than unity, according as nk < nyr or nk > nr. This means that with 
increasing n the sequence g;(n) first increases and then decreases; it 
reaches its maximum when 7 is the largest integer short of ,r/k, so 
that ἢ equals about nir/k. In our particular example the maximum 
likelihood estimate of the number of fish is ἢ = 10,000. 

The true number n may be larger or smaller, and we may ask for 
limits within which we may reasonably expect 7 to lie. For this pur- 
pose let us test the hypothesis that 7 is smaller than 8500. We sub- 
stitute in (6.1) n = 8500, mn; = r = 1000, and calculate the probability 
that the second sample contains 100 or fewer red fish. This probability 
ist ΞΖ (0 -Ἔ φι Ἔ...-Ἔ 4φιοο. A direct evaluation is cumbersome, but 
using the normal approximation of chapter VII, we find easily that 
x = 0.04. Similarly, if n = 12,000, the probability that the second 
sample contains 100 or more red fish is about 0.03. These figures 
would justify a bet that the true number ἢ of fish lies somewhere be- 
tween 8500 and 12,000. There exist other ways of formulating these 
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conclusions and other methods of estimation, but we do not propose to 
discuss the details. 


From the definition of the probabilities q; it follows that go + q: + 
+qo+...= 1. Formula (6.2) therefore implies that for any positive 
integers n, 7, andr . 


4 OL IMD Co) -C) 
65 0 ny 1 nm -- 1 one Ny 0 U3) 
This identity is frequently useful. We have proved it only for positive 
integers n and 7, but it holds true without this restriction for arbitrary 
positive or negative numbers ἢ and r (it is meaningless if n, is not a 
positive integer). (An indication of two proofs is given in section 12, 
problems 8 and 9.) | 

The hypergeometric distribution can easily be generalized to the case 
where the original population of size n contains several classes of ele- 
ments. For example, let the population contain three classes of sizes 
N1, Ng, and ἢ — ny — Ne, respectively. If a sample of size r is taken, 
the probability that it contains k, elements of the first, ὅς elements of 
the second, and r — k, — Κα elements of the last class is, by analogy 
with (6.1), | 


ny Ne n—-n τ Ne n 
5 Gi eek ode 
( ky ke r— ky — ko r 
It is, of course, necessary that ky < m1, ke < ne, andr — ky —ke < 
In —n — Ne. 


Example. (ὦ) Bridge. The population of 52 cards consists of four 
classes, each of thirteen elements. The probability that a hand of thir- 
teen cards consists of five spades, four hearts, three diamonds, and one 


— /18\ /18\ (ΒΝ {13 52 
abn) (5) (1) * Gs): 
5/\4/\3/\1 13 
7. EXAMPLES FOR WAITING TIMES 


In this section we shall depart from the straight path of combinatorial | 
analysis in order to consider some sample spaces of a novel type to 
which we are led by a simple variation of our occupancy problems. 
Consider once more the conceptual “experiment”’ of placing balls ran- 
domly into n cells. This time, however, we do not fix in advance the 
number r of balls but let the balls be placed one by one as long as nec- 
essary for a prescribed situation to arise. Two such possible situations 
will be discussed explicitly: () The random placing of balls continues 
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until for the first time a ball 18 placed into a cell already occupied. The 
process terminates when the first duplication of this type occurs. 
(11) We fix a cell (say cell number 1) and continue the procedure of placing 
balls as long as this cell remains empty. The process terminates when a 
ball is placed into the prescribed cell. 

A few interpretations of this model will elucidate the problem. 


Examples. (a) Birthdays. In the birthday example (8.d), the 
nm = 365 days of the year correspond to cells, and people to balls. 
Our model (i) now amounts to this: If we select people at random one 
by one, how many people shall we have to sample in order to find a 
pair with a common birthday? Model (ii) corresponds to waiting for 
my birthday to turn up in the sample. 

(b) Key problem. A man wants to open his door. He has n keys, 
of which only one fits the door. For reasons which can only be sur- 
mised, he tries the keys at random so that at each try each key has 
probability n—! of being tried and all possible outcomes involving the 
same number of trials are equally likely. What is the probability that 
the man will succeed exactly at the rth trial? This is a special case of 
model (ii). It is interesting to compare this random search for the key 
with a more systematic approach (problem (10.11); see also problem 
V, 5). 

(c) In the preceding example we can replace the sampling of keys 
by a sampling from an arbitrary population, say by the collecting of 
coupons. Again we ask when the first duplication is to be expected 
and when a prescribed element will show up for the first time. 

(d) Coins and dice. In example I(5.a) a coin is tossed as often as 
necessary to turn up one head. This is a special case of model (ii) 
with n = 2. When a die is thrown until an ace turns up for the first 
time, the same question applies with n = 6. (Other waiting times are 
treated in problems 21, 22, and 36 of section 10, and 12 of section 11.) 


We begin with the conceptually simpler model (i). It is convenient 
to use symbols of the form (jj, jz, ...,jr) to indicate that the first, 
second, ..., rth ball are placed in cells number ji, jo, ..., 7, and that 
the process terminates at the rth step. This means that the 7; are in- 
tegers between 1 and n; furthermore, j;, ..., 7-1 are all different, but 
jr equals one among them. Every arrangement of this type represents 
a sample point. For r only the values 2, 3, ..., +1 are possible, since 
a doubly occupied cell cannot appear before the second ball or after the 
(n-+1)st ball is placed. The connection of our present problem with 
the old model of placing a fixed number of balls into the n cells leads 
us to attribute to each sample point (71, ...,j,r) involving exactly r 
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balls the probability n~”. We proceed to show that this convention is 
permissible (i.e., that our probabilities add to unity) and that it leads 
to reasonable results. 

For a fixed r the aggregate of all sample points (j1, ...,j,) represents 
the event that the process terminates at the rth step. According to (3.1) 
the numbers j;, ..., j;_1 can be chosen in (n),_, different ways; for 
jr we have the choice of the r — 1 numbers j;, ..., j-_1. It follows 
that the probability of the process terminating at the rth step is 


(7.1) q = Der P= (1-2)... (=) 


nr n nr nN 


with qi = 0 and qo = 1/n. The probability that the process lasts for 
more than r steps is Ὁ, = 1 — (qi + 4 +..-+4@,) or pi = 1 and 


Pe ee eae 


as can be seen by simple induction. In particular, p,,, = 0 and 
gi +..-+ Qn4i1 = 1, as 1s proper. Furthermore, when n = 365, for- 
mula (7.2) reduces to (8.2), and in general our new model leads to the 
same quantitative results as the previous model involving a fixed num- 
ber of balls. 


The model (1) differs from (i) in that it depends on an infinite sample 
space. ‘The sequences (71, ..., jr) are now subjected to the condition 
that the numbers j;, ..., 7... are different from a prescribed number 
a <n, but j, = a. Moreover, there is no a priori reason why the 
process should ever terminate. For a fixed r we attribute again to each 
sample point of the form (71, ...,j,) probability η΄. For jy, ...,jp—1 
we have ἢ — 1 choices each, and for j, no choice at all. For the prob- 
ability that the process terminates at the rth step we get therefore 


n-1\""?! 1 
(7.3) at = ( ) — a eee 
n n 


Summing this geometric series we find q,* + q2* +...= 1. Thus the 
probabilities add to unity, and there is no necessity of introducing a 
sample point to represent the possibility that no ball will ever be placed 
into the prescribed cell number a. For the probability 


prt = 1 — (φι" +...+ «ἢ 
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that the process lasts for more than r steps we get 


ΔΩ 
(7.4) prt = (1 -ἡ ᾿ r=1,2,... 
n 


as was to be expected. 


_ The medians for the distributions {p,} and {p,*} are defined as those 
values of r for which p, and p,* come closest to 3; it is about as likely 
that the process continues beyond the median as that it stops before. 
(In the birthday example (8.4) the median is r = 23.) To calculate the 
median for {p,} we pass to logarithms as we did in (3.4). When 7 is 
small as compared to n, we see that —log p, is close to r?/2n. It fol- 
lows that the median to {p,} is close to (n-2-log 2)} or, approximately 
$n}. It is interesting that the median increases with the square root 
of the population size. By contrast, the median for {p,*} 18 close to 
n-log 2 or 0.7n and increases linearly with n. The probability of the 
waiting time in model (ii) to exceed ἡ is (1 — n~*)” or, approximately, 
67} = 0.36788.... 


8. BINOMIAL COEFFICIENTS 


n 
We have used binomial coefficients ( ) only when n is a positive 
| r 


integer, but it is very convenient to extend their definition. The num- 
ber (x), introduced in equation (2.1), namely 


(8.1) (t), = σῷ -- 1)’ ὦ -τα ἡ 


is well defined for all real z provided only that r is a positive integer. 
For r = 0 we put (z)p = 1. Then 


(ἢ) - ἀ( “5 1) πο ὦ ΞΞ ΞΕ Ὲ 


(8.2) 


γ γ r! 


defines the binomial coefficients for all values of x and all positwe integers r. 


x 
For r = 0 we put, as in (4.4), a = land0! = 1. For negative integers 


r we define 


(8.3) (*) =0 <0). 


x 
We shall never use the symbol ( ) if r is not an integer. 
tf 
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It is easily verified that with this definition we have, for example, 


(8.4) (~") = (~1)" (Ἴ Ξ οὐ ῖ 


Three important properties will be used in the sequel. First, for any 
positive integer n 


n 
(8.5) (") = 0 ifeither r>n or r<0O. 
r 


Second, for any number x and any integer r 


89 δ δ 


These relations are easily verified from the definition. The proof of 
the next relation can be found in calculus textbooks: for any number a 
and all values —1 « t < 1, we have Newton’s binomial formula 


8) (+)*= 1+ (e+ (Ses (6 τ|.Ψ 


If ais a positive integer, all terms to the right containing powers higher 
than ¢* vanish automatically and the formula is correct for all ¢. Ifa 
is not a positive integer, the right side represents an infinite series. 

Using equation (8.4), we see that for a = —1 the expansion (8.7) 
reduces to the geometric series 


(8.8) τ eer er eer ee eee ee -1<t<1 
1. } ; " 


Integrating (8.8), we obtain another formula which will be useful in 
the sequel, namely, the Taylor expansion of the natural logarithm | 


(8.9) log (1 +2) =t — 4 +4 48 — Gtt+..., -l<t<1. 


Two alternative forms for (8.9) are frequently used. Replacing ¢ by 
—t we get | 


1 
(8.10) log Ἶ 


Ξε ἐδ Ἐ Ὁ Ὁ ab Ἔ..., -1<t<1l. 


Adding the last two formulas we find 


+t 
mae δ ec —-l<ti<l. 


1 
(8.11) z log 
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For 0 < ¢ < 1, the right-hand member of (8.10) exceeds ἐὺ is smaller 


than t+ #2+2+4...=t/(1 —#). Hence we have the double in- 
equality 

t 
(8.12) 2 <a --ἰ « es, 0<t<1. 


Many useful relations and identities will be derived from (8.7 ) in sec- 
tion 12. Here we mention only that for any positive integer n we find, 
letting ¢ = 1, 


a 50 :0...0-Ὦ 


Incidentally, this formula admits of a simple combinatorial interpre- 
tation: The left side represents the number of ways in which a popula- 
tion of n elements can be divided into two subpopulations if the size 
of the first group is permitted to be any number k = 0,1, ...,. On 
the other hand, such a division can be effected directly by deciding for 
each element whether it is to belong to the first or second group. (A 
similar argument shows that the multinomial coefficients (4.7) add up 
to k”.) 

9. STIRLING’S FORMULA 


An important tool of analytical probability theory is contained in a 
classical theorem known as 


Stirling’s Formula: 
(9.1) nl ~ (2r)in® the 


where the sign ~ is used to indicate that the ratwo of the two sides tends to 
unity asn —> ©, 

This formula is invaluable for many theoretical purposes and can be 
used also to obtain excellent numerical approximations. It is true that 
the difference of the two sides in (9.1) increases over all bounds, but it 
is the percentage error which really matters. It decreases steadily, and 
Stirling’s approximation is remarkably accurate even for small n. In 
fact, the right side of (9.1) approximates 1! by 0.9221 and 2! by 1.919 
and 5! = 120 by 118.019. The percentage errors are 8 and 4 and 2, 
respectively. For 10! = 3,628,800 the approximation is 3,598,600 with 
an error of 0.8 per cent. For 100! the error is only 0.08 per cent. 

Proof of Stirling’s formula. We consider 


(9.2) dn = log2 + log3 +...+ log (n — 1) + § logn 


4 James Stirling, Methodus differentialis, 1730. 
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Ficurs 1. Illustrating the derivation of Stirling’s formula and, more generally, 
the approximation of sums by integrals. 


which differs from log n! only by the factor 4 in the last term to the 
right. We shall show that a, represents the areas of two different 
polygons, and this remark will lead to two bounds for log n!. Figure 
1 illustrates the situation for the special case n = 4. On writing 


(9.3) an = ${log 1 + log 2} + 4 {log 2 + log 3} +...+ 
+ 3 {log (n — 1) + log n} 


_ jt becomes apparent that a, equals the area of the trapezoid whose 
vertices are the points A,, Ag, ..., An of the curve y = log x with 
abscissas 1, 2, ..., ἢ and the point (n, 0) of the z-axis. This trape- 
z0id being inside the curve, its area is smaller than the area of the 
domain bounded by the curve, the z-axis, and the line z = n. 

On the other hand, log k equals the area of the trapezoid with basis 
k —4<2<k-+4 and bounded above by the tangent to the curve 
at the point A; = (k, log k). It follows that log (n — 1)! is greater than 
the area of the domain bounded by y = log z, the a-axis, and the ver- 
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tical Ἰἰη6 5.2 = 3 and x = n — 4. Now 32 logn quite obviously ex- 


ceeds the area of the strip n — Σ < x < ἢ under the curve, and hence 
Qn exceeds the area under the curve and between x = ὃ andz =n. In 


other words, we have shown that 


(9.4) f log x - dz < ay «[ log x - dx. 
$ 1 


The indefinite integral of log x is given by x log z — z, and equation 
(9.4) reduces to the double inequality 


(9.5) (n+ 4) logn —n+ 3(1 — log 3) < 


< logn! < (n+ §) logn —n +1. 
Put for abbreviation 


(9.6) δι = logn! — (n+ 4) logn +n. 


Then 1 — 6, is the difference between the extreme right member of 
(9.5) and log n!, that is, 1 — 6, equals the area of the domain between 
the curve y = log z and the polygon A,A,... An. It follows that 6, 
decreases monotonically. But by (9.5) we have 3(1 — log 3) < ὃ, « 1. 
We conclude that ὃ, tends to a limit comprised between 1 and 
3(1 — log 3). Denoting this limit by log c we have 


(9.7) bn — loge where 2.45 < c¢ < 2.72. 


In logarithmic notation Stirling’s formula reduces to (9.7) with c = (27)! 
(or 2.507, approximately). Now π can be defined in many ways, and 
for our purposes it is simplest and most natural to define + = c”/2. 
With this definition we have Stirling’s formula, but it remains to show 
that the constant so defined agrees with the more familiar w of other 
formulas. This fact will develop as a by-product of other calculations 
in chapter VII, and so the proof of Stirling’s formula will be completed 
there. 


Refinements. Stirling’s formula can be improved by the addition of further 
terms. Although we shall never make use of such refinements, we shall here indi- 
cate the proof of the following double inequality 15 


(9.8) (29) in” He" t1/ (122-41) <ni< (Qr)in™ He + (12n) 


To prove (9.8) note that 


1 n+l 1 1 
(9.9) ὃ, — Ono = (n +5) oe" — -1 = 35 tage pet 


18 H. Robbins, A remark on Stirling’s formula, American Mathematical Monthly, 
vol. 62 (1955), pp. 26-29. 
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(the last expansion follows from (8.11) on setting ἐ = 1/(2n + 1)]. We increase the 
extreme right member in (9.9) by replacing the coefficients 4, +, $, ... by 4; this 
leads to a geometric series with ratio (2n + 1)~?, and thus 


1 1 1 
oy on τ Ontt < So GD? i] lan 12n+1) 
Accordingly, 5, — 1/12n increases monotonically. Now the limit of this sequence 
is given by Stirling’s formula, and passing to antilogarithms we have the second 
inequality in (9.8). The first inequality follows similarly from (9.9) on noticing 
that 
1 1 1 


9.11 bn — ὃ SS ee ae ee a ee, 
(9.1) "ΤΟ Unt “ 3(2n +127 12.1 12n+1) +1 

The accuracy of the approximations (9.8) is remarkable; even for n = 1 the 
formula leads to the two bounds 0.9958... and 1.0023.... The upper bound pro- 
vided in (9.8) is slightly better [cf. (12.28)]. For n = 2 it yields 2.0007, for n = 5 
we get 120.01..., and for n = 10 the first five significant figures are correct. 


PROBLEMS FOR SOLUTION 


Note: Sections 11 and 12 contain problems of a different character and diverse 
complements to the text. 


10. EXERCISES AND EXAMPLES 
Note: Assume in each case that all arrangements have the same probability. 


1. How many different sets of initials can be formed if every person has one 
surname and (a) exactly two given names, (6) at most two given names, (c) 
at most three given names? 


2. In how many ways can two rooks of different colors be put on a chess- 
board so that they can take each other? 


3. Letters in the Morse code are formed by a succession of dashes and dots 
with repetitions permitted. How many letters is it possible to form with ten 
symbols or less? 


4, Each domino piece is marked by two numbers. The pieces are symmetri- 
cal so that the number-pair is not ordered. How many different pieces can be 
made using the numbers 1, 2, ..., n? 


5. The numbers 1, 2, ..., ἢ are arranged in random order. Find the proba- 
bility that the digits (a) 1 and 2, (6) 1, 2, and 3, appear as neighbors in the 
order named. 

6. (a) Find the probability that among three random digits there occur 2, 1, 
or 0 repetitions. (6) Do the same for four random digits. 


7. Find the probabilities p, that in a sample of r random digits no two are 
equal. Estimate the numerical value of pio, using Stirling’s formula. 


8. What is the probability that among k random digits (a) 0 does not appear; 
(b) 1 does not appear; (c) neither 0 nor 1 appears; (d) at least one of the two 
digits 0 and 1 does not appear? Let A and B represent the events in (a) and 
(ὃ). Express the other events in terms of A and B. 
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9. If n balls are placed at random into n cells, find the probability that 
exactly one cell remains empty. 


10. At a parking lot there are twelve places arranged in a row. A man ob- 
served that there were eight cars parked, and that the four empty places were 
adjacent to each other (formed one run). Given that there are four empty 
places, is this arrangement surprising (indicative of non-randomness)? 


11. A man is given n keys of which only one fits his door. He tries them 
successively (sampling without replacement). This procedure may require 1, 
2, ..., ” trials. Show that each of these n outcomes has probability n—!. 


12. Suppose that each of n sticks is broken into one long and one short part. 
The 2n parts are arranged into n pairs from which new sticks are formed. 
Find the probability (a) that the parts will be joined in the original order, (6) 
that all long parts are paired with short parts.'* 

18. Testing a statistical hypothesis. A Cornell professor got a ticket twelve 
times for illegal overnight parking. All twelve tickets were given either 
Tuesdays or Thursdays. Find the probability of this event. (Was his renting 
a garage only for Tuesdays and Thursdays justified?) 


14. Continuation. Of twelve police tickets none was given on Sunday. Is 
this evidence that no tickets are given on Sundays? 


15. A box contains ninety good and ten defective screws. If ten screws are 
used, what is the probability that none is defective? 


16. From the population of five symbols a, ὃ, c, d, 6, a sample of size 25 is 
taken. Find the probability that the sample will contain five symbols of each 
kind. Check the result in tables of random numbers,!” identifying the digits 
0 and 1 with a, the digits 2 and 3 with ὃ, etc. 


17. If men, among whom are A and B, stand in a row, what is the probabil- 
ity that there will be exactly r men between A and B? If they stand in a ring 
instead of in a row, show that the probability is independent of r and hence 
1/(n — 1). (In the circular arrangement consider only the are leading from 
A to B in the positive direction.) 

18. What is the probability that two throws with three dice each will show 
the same configuration if (a) the dice are distinguishable, (6) they are not? 


19. Show that it is more probable to get at least one ace with four dice than 
at least one double ace in 24 throws of two dice. (The answer is known as 
de Méré’s paradox. Chevalier de Méré, a gambler, thought that the two 
probabilities ought to be equal and blamed mathematics for his losses.) 


20. From a population of n elements a sample of size r is taken. Find the 
probability that none of N prescribed elements will be included in the sample, 


16 When cells are exposed to harmful radiation, some chromosomes break and 
play the role of our “sticks.” The “long” side is the one containing the so-called 
centromere. If two “long” or two “short” parts unite, the cell dies. See Ὁ. G. 
Catcheside, The effect of X-ray dosage upon the frequency of induced structural 
changes in the chromosomes of se a Melanogaster, Journal of Genetics, vol. 
36 (1938), pp. 307-320. 

17 They are occasionally extraordinarily obliging: see J. A. Greenwood and E. BE. 
Stuart, Review of Dr. Feller’s critique, Journal for Parapsychology, vol. 4 (1940), 
pp. 298-319, in particular p. 306. 
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assuming the sampling to be (a) without, (6) with replacement. Compare 
the numerical values for the two methods when (i) n = 100, r = N = 8, and 
(1) n = 100, r = N = 10. 

21. Spread of rumors. In a town of n+ 1 inhabitants, a person tells a 
rumor to a second person, who in turn repeats it to a third person, etc. At 
each step the recipient of the rumor is chosen at random from the n people 
available. Find the probability that the rumor will be told r times without: 
(a) returning to the originator, (Ὁ) being repeated to any person. Do the same 
problem when at each step the rumor is told to a gathering of N randomly 
chosen people. (The first question is the special case N = 1.) 


22. Chain letters. In a population of n + 1 people a man, the “progenitor,” 
sends out letters to two persons, the “first generation.”” These repeat the per- 
formance and, generally, each member of the rth generation sends out letters 
to two persons chosen at random. Find the probability that the generations 
number 1, 2, ..., 7 will not include the progenitor. Find the median of the 
distribution, supposing 7 to be large. 


23. A familiar problem. In a certain family four girls take turns at washing 
dishes. Out of a total of four breakages, three were caused by the youngest 
girl, and she was thereafter called clumsy. Was she justified in attributing 
the frequency of her breakages to chance? Discuss the connection with ran- 
dom placements of balls. 


24. What is the probability that (a) the birthdays of twelve people will fall 
in twelve different calendar months (assume equal probabilities for the twelve 
months), (6) the birthdays of six people will fall in exactly two calendar months? 


25. Given thirty people, find the probability that among the twelve months 
there are six containing two birthdays and six containing three. 


26. A closet contains ἢ pairs of shoes. If 27 shoes are chosen at random 
(with 2r < n), what is the probability that there will be (a) no complete pair, 
(b) exactly one complete pair, (c) exactly two complete pairs among them? 

27. A car is parked among N cars in a row, not at either end. On his return 
the owner finds that exactly r of the N places are still occupied. What is the 
probability that both neighboring places are empty? 


28. A group of 2N boys and 2N girls is divided into two equal groups. “Find 
the probability p that each group will be equally divided into boys and girls. 
Estimate p, using Stirling’s formula. 

29. In bridge, prove that the probability p of West’s receiving exactly k 
aces is the same as the probability that an arbitrary hand of thirteen cards 
contains exactly k aces. (This is intuitively clear. Note, however, that the 
two probabilities refer to two different experiments, since in the second case 
thirteen cards are chosen at random and in the first case all 52 are distributed.) 

30. The probability that in a bridge game East receives m and South n 
spades is the same as the probability that of two hands of thirteen cards each, 
drawn at random from a deck of bridge cards, the first contains m and the 
second n spades. 

31. What is the probability that the bridge hands of North and South to- 
gether contain exactly k aces, where k = 0, 1, 2, 3, 4? 

32. Let a, ὃ, c, d be four non-negative integers such thata + ὃ +c+d = 13. 
Find the probability p(a, ὃ, c, d) that in a bridge game the players North, East, 
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South, West have a, ὃ, c, d spades, respectively. Formulate a scheme of plac- 
ing red and black balls into cells that contains the problem as a special case. 

33. Using the result of problem 32, find the probability that some player 
receives a, another ὃ, a third c, and the last d spades if (a) a = 5,6 = 4,6. = ὃ, 
ἀ -Ξ 1: (Ὁ) α -- ὃ Ξξ-Ἠ ᾿ς -Ξ 4, ἃ -- 1; (ο)ὴ α -ΞΞ. ὃ -- 4, -Ξ ὃ, ἃ = 2. 

Note that the three cases are essentially different. 

34. Let a, ὃ, c, d be integers with a + ὃ - ὁ ἃ = 18. Find the probabil- 
ity g(a, ὃ, c, 4) that a hand at bridge will consist of a spades, ὃ hearts, c dia- 
monds, and d clubs and show that the problem does not reduce to one of plac- 
ing, at random, thirteen balls into four cells. Why? 

35. Distribution of aces among r bridge cards. Calculate the probabilities 
por), pi(r), ..., pa(r) that among r bridge cards drawn at random there are 
0,1, ..., 4 aces, respectively. Verify that po(r) = pa(52 — 7). 

36. Continuation: waiting times. If the cards are drawn one by one, find 
the probabilities f:(r), ..., fa(r) that the first, ..., fourth ace turns up at the 
rth trial. Guess at the medians of the waiting times for the first, ..., fourth 
ace and then calculate them. 

37. Find the probability that each of two hands contains exactly k aces if 
the two hands are composed of r bridge cards each, and are drawn (a) from 
the same deck, (b) from two decks. Show that when r = 13 the probability 
in part (a) is the probability that two preassigned bridge players receive exactly 
k aces each. 

38. Misprints. Each page of a book contains N symbols, possibly mis- 
prints. The book contains n = 500 pages and r = 50 misprints. Show that 
(a) the probability that pages number 1, 2, ..., m contain, respectively, 
11, T2, .»+, Tn misprints equals 


(CG) G+): 


(b) for large N this probability may be approximated by (5.5). Conclude that 
the r misprints are distributed in the n pages approximately in accordance with a 
random distribution of r balls in n cells. (Note. This may be restated as a 
general limiting property of Fermi-Dirac statistics. Cf. section 5.) 


Note: The following problems refer to the material of section 5. 


39. If ry indistinguishable things of one kind and re indistinguishable things 
of a second kind are placed into n cells, find the number of distinguishable 
arrangements. 

40. If γι dice and re coins are thrown, how many results can be distinguished? 

41. In how many different distinguishable ways can 1: white, rz black, and 73 
red balls be arranged? 

42. Find the probability that in a random arrangement of 52 bridge cards 
no two aces are adjacent. 

43, Elevator. In the example (3.c) the elevator starts with seven passen- 
gers and stops at ten floors. The various arrangements of discharge may be 
denoted by symbols like (3, 2, 2), to be interpreted as the event that three 
passengers leave together at a certain floor, two other passengers at another 


11.11] THEORETICAL PROBLEMS 57 


floor, and the last two at still another floor. Find the probabilities of the fifteen 
possible arrangements ranging from (7) to (1, 1, 1, 1, 1,1, 1). 

44, Birthdays. Find the probabilities for the various configurations of the 
birthdays of 22 people. 

45. Find the probability for a poker hand to be a (a) royal flush (ten, jack, 
queen, king, ace in a single suit); (b) four of a kind (four cards of equal face 
values); (c) full house (one pair and one triple of cards with equal face values) ; 
(d) straight (five cards in sequence regardless of suit); (e) three of a kind (three 
equal face values plus two extra cards); (f) two pairs (two pairs of equal face 
values plus one other card); (g) one pair (one pair of equal face values plus 
three different cards). 


11. PROBLEMS AND COMPLEMENTS OF A THEORETICAL 
CHARACTER 


1. A population of n elements includes np red ones and ng black ones 
(p +q= 1). A random sample of size r is taken with replacement. Show 
that the probability of its including exactly & red elements is 


(11.1) () pig. 


2. A limit theorem for the hypergeometric distribution. If n is large and 
ni/n = p, then the probability q. given by (6.1) and (6.2) is close to (11.1). 
More precisely, 


aia) (Ὁ 6. - ἢ (@- A) <a< Gert 3) 


A comparison of this and the preceding problem shows: For large populations 
there is practically no difference between sampling with or without replacement. 

3. A random sample of size r without replacement is taken from a population 
of n elements. The probability u, that N given elements will all be included 
in the sample is 


as “= C2) ὦ 


(The corresponding formula for sampling with replacement is given by (11.10) 
and cannot be derived by a direct argument. For an alternative form of (11.3) 
ef. problem IV, 9.) 

4. Limiting form. If n > «© and r — © so that r/n — p, then u, — p* 
(cf. problem 13). 


Note: Problems 5-13 refer to the classical occupancy problem (Maxwell-Bolizmann 
statistics): That is, r balls are distributed among n cells and each of the n” possible dis- 
tributions has probability n~.1® 


18 Problems 5-19 play a role in quantum statistics, the theory of photographic 
plates, G-M counters, etc. The formulas are therefore frequently discussed and 
discovered in the physical literature, usually without a realization of their classical 
and essentially elementary character. Probably all the problems occur (although 
in modified form) in the book by Whitworth quoted at the opening of this chapter. 
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5. The probability », that a given cell contains exactly & balls is given by 
the binomial distribution (4.5). The most probable number is the integer v 
such that (r — n + 1)/n <v <(r + 1)/n. (In other words, it is asserted that 
Po < D1 <<... < Py-1 S Dy > Pri 60...» Pr; ef. problem 15.) 

6. Limiting form. If n — © and r — οὐ so that the average number 
λ = r/n of balls per cell remains constant, then 


(11.4) De -- 6 ΧΡ ΚΙ. 


This is the Poisson distribution, discussed in chapter VI; see problem 16. 
7. Let A(r,n) be the number of distributions leaving none of the n cella 
empty. Show by a combinatorial argument that 


(11.5) dena = > () A(r—k, n). 
k=1 i 
Conclude that 
(11.6) A(r,n) = 2 ιν () ἘΞ: 
ναῷ » 


Hint: Use induction; assume (11.6) to hold and express A(r—k, 7) in 
(11.5) accordingly. Change the order of summation and use the binomial 
formula to express A(r, n+1) as the difference of two simple sums. Replace 
in the second sum v + 1 by a new index of summation and use (8.6). 


Note: Formula (11.6) provides a theoretical solution to an old problem but obviously 
at would be a thankless task to use it for the calculation of the probability x, say, that in 
a village of r = 1900 people every day of the year 18 a birthday. In chapter IV, section 
2, we shall derive (11.6) by another method and obtain a simple approximation formula 
(showing, e.g., that x = 0.135, approximately). 


8. Show that the number of distributions leaving exactly m cells empty 1s 
ὯΔ, ὮΝ τ᾿ n—™m ᾿ 
(11.7) E,(r,n) = (") A(r,n—m) = (2) Σ (-ἢ 5, ) (n—m-—vy)’. 
9. Show without using the preceding results that the probability 
Dilt, Ὁ) = n-"E (7, n) 
of finding exactly m cells empty satisfies 


n—-m 
n 


(118) pal $1, n) = pelt, π) + py s(t, 0) ΤΠ - τ 

10. Using the results of problems 7 and 8, show by direct calculation that 
(11.8) holds. Show that this method provides a new derwation (by induction 
on r) of (11.6). 

11. From (11.6) and problem 8 conclude that the probability of finding m 
or more cells empty 1s | 


om Serr 6-4... 


»παῦ n 


(For m 2 n this expression reduces to zero, as is proper.) 
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12. The probability that each of N given cells is occupied 18 


(11.10) ia (Π) A(k, N\in — NY 
Κιαῦ 

Conclude that 

(11.11) u(r, n) = Σ (—1) () (: Ἢ *\. 


(Use the binomial theorem. For N =n we have u(r, n) = n~" A(r,n). 
Note that (11.11) is the analogue of (11.3) for sampling with replacement.’ 
For an alternative derivation see problem IV, 8.) 

13. Limiting form. For the passage to the limit described in problem 4 one 
has u(r, ἢ) — (1 — e7?)%. 


Note: In problems 14-19 r and n have the same meaning as above, but we assume 
that the balls are indistinguishable and that all distinguishable arrangements have 
equal probabilities (Bose-Einstein statistics). 


14. The probability that a given cell contains exactly k balls is 


amy ae (TEACH) 


15. Show that when n > 2 zero is the most probable number of balls in 
any specified cell, or more precisely, go > q1 >... (cf. problem 5). 

16. Limit theorem. Let n — «© andr — οὐ, so that the average number of 
particles per cell, r/n, tends toA. Then 


No 
(11.13) dk --ἷ} (i+ 


(The right side is known as the geometric distribution.) 
17. The probability that exactly m cells remain empty is 


(11.14) pn = (") (τ ΓῪ 


19 Note that u(r, n) may be interpreted as the probability that the wavting time 
up to the moment when the Nth element joins the sample is less than r. The 
result may be applied to random sampling digits: here u(r, 10) — u(r — 1, 10) is 
the probability that a sequence of r elements must be observed to include the 
complete set of all ten digits. This can be used as a test of randomness. R. E. 
Greenwood (Coupon collector’s test for random digits, Mathematical Tables and 
Other Aids to Computation, vol. 9 (1955), pp. 1-5) has tabulated the distribution and 
compared it to actual counts for the corresponding waiting times for the first 2035 
decimals of x and the first 2486 decimals of e. The median of the waiting time for 
a complete set of all ten digits is 27. The probability that this waiting time exceeds 
50 is greater than 0.05, and the probability of the waiting time exceeding 75 is 
about 0.0037. 
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18. The probability that a group of m prescribed cells contains a total of 
exactly 7 balls is 


11.15) afm) = (Ὁ 17 1) Cee Ca). 


rj 
19. Limiting form. For the passage to the limit of problem 4 we have 
| m cca ‘) ee ees 


(The right side is a special case of the negative binomial distribution to be intro- 
duced in chapter VI.) 


Theorems on Runs. In problems 20-25 we consider arrangements of τι alphas 
and ro betas and assume that all arrangements are equally probable [see example (4.d)}. 
This group of problems refers to section 5a. 


20. The probability that the arrangement contains exactly k runs of either 
kind is 


(11.17) Py = 2 (τὴ (1: = (" " Ἵ 


when ὦ = 2ν is even, and 


aus) Pa {O ) ἡ τ CS de Cn”) 


when k = 2y + 1 is odd. 
21. Continuation. Conclude that the most probable number of runs is an 


2rire <ke< 2rire + 3. (Hint: Consider the ratios 


γι + Te Ti + 12 
Pay + Py and Poy4i + Poy—1.) 

22. The probability that the arrangement starts with an alpha run of length 
vy = Ois (71)yre - (τι + 72)y41. (Hint: Choose the v alphas and the beta which 
must follow it.) What does the theorem imply for vy = 0? 

23. The probability of having exactly & runs of alphas is 


seta) een (" : ἢ Ce ἘΠ ') + Ce Ἢ ω 


Hint: This follows easily from the second part of the lemma of section 5. 
Alternatively, equation (11.19) may be derived from (11.17) and (11.18), but 
this procedure is more laborious. 

24. The probability that the nth alpha is preceded by exactly m betas is 


εν ον μὴ CEM 


7. τ 


integer k such that 


25. The probability for the alphas to be arranged in k& runs of which k; 
are of length 1, ke of length 2, ..., ἄν of length ν (with ki + ...+h, = k) is 


k! Tretl\ . (mitre 
(11.21) kylke!. : ky! k ) ; ( Ti ) 
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12. PROBLEMS AND IDENTITIES INVOLVING BINOMIAL 
COEFFICIENTS 


1. For integral n > 2 


ἱτῷῶτῷ τοῦς 
() +2(2)+9() += 
()-2(2)+9G@)- +=" 


2-1 (1) +3-2(5) τ 4.8 (0) +---= mn — Dar 


(Hint: Use the binomial formula.) 
2. Prove that for positive integers τ, k 


2 ()Q)-CG2D)+OG2*@Cs)- 


More generally 30 


aaa EC) (:)"- (ate 
3. For any a > 0 
on (ὦ τον 1 


If a is an integer, this can be proved also by differentiation of the geometric 
series 2z* = (1 — x). 
4. Prove that 


2n —2n — {... 1)5π -ἢ zs 
2) Go? ὦ ι- ) 
5. For integral ee n and r and all real a 
Ρ α--ν a+1 α-- ἢ 
τὺ; Σ ae ( + i) ( +1) 


(Hint: Use equation (8.6). The special case n = a is frequently used.) 
6. For arbitrary a and integral n > 0 


= a 2f(a—1 
mn £ew()-cm (1} 
[Hint: Use equation (8.6).] 


2 The reader is reminded of the convention (8.5): if » runs through all integers, 
only finitely many terms in the sum in (12.3) are different from zero. 7 
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7. For positive oo r, k 
yv+k—-1 r+k 
24-1 CG) 


(a) Prove this using (8.6). (6) Show that (12.8) is a special case of (12.7). (ὁ 
Show by an inductive argument that (12.8) leads to a new proof of the first 
part of the lemma of section 5. 


8. In section 6 we remarked that the terms of the hypergeometric distribu- 
tion should add to unity. This amounts to saying that for any positive integers 
a, b, n, 


we Ὁ) τ 5) 1τ- 0 - 75 


Prove this by induction. (Hint: Prove first that equation (12.9) holds for 
a = 1 and all δ.) 


9. Continuation. By a comparison of the coefficients of t* on both sides of 
(12.10) (1 τ-ἰδὰ + ὥς = (1 το 


prove more generally that (12.9) is true for arbitrary numbers a, ὃ (and in- 
tegral n). 
10. Using equation (12.9), prove that 


OO LTOLAO LEO LIC} 


11. Using equation (12.10), prove that 
(12.12) eh =. = Ὑ. 


y=0(v!)2(n — v2 ὁ 


12. Prove that for integers 0 < a «ὃ 


ay BOM το τ 


Hint: Using (12.4) show that (12.11) is a special case of (12.9). Alternatively, 
compare coefficients of #—! in (1 — δ (1 — ὃ δ = (1 — t2-o-, 
13. By specialization derive from (12.9) the identities 


aay ()-(,2,):- π( εἰς (1 


and 3 
(12.15) dX (--’ (*) (" , Ἴ a ( Ξ 4)" 


valid if k, n, and r are positive integers. [Hint: Use (12.4).] 
14, Using equation (12.9), prove that 


(12.16) Σ πο ΞΡ ae ') = Cerne), 


(12.8) 
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(Hint: Apply equation (12.4) back and forth.) Note the important special 
cases ὃ = I, 2. 

15. Referring to the problems of section 11, notice that equations (11.12), 
(11.14), (11,15), and (11.16) define probabilities. In each the quantities should 
therefore add to unity. Show that this is implied, respectively, by (12.8), 
(12.9), (12.16), and the binomial theorem. 

16. From the definition of A(r, ἢ) in problem 7 of section 11 it follows that 
A(r, n) = Oif r <nand A(n,n) = n!. In other words 

= 0 ifr<n 
ll : - 71)π-ὰ () Pog 
(12.17) 2, (-1) k ᾿ η ifr=n. 


k=0 


(a) Prove (12.17) directly by reduction from n to n—1. (6) Next prove 
(12.17) by considering the rth derivative of (1 — e’)” at ¢ = 0. (c) Generalize 
(12.17) by starting from (11.11) instead of (11.6). 


17. If 0 < N < n prove by induction that for each integer r > 0 
Ν Ν n—WN 
δ .-Τ» — = ! 


(Note that the right-hand member vanishes when r < N and when r > n.) 
Verify (12.18) by considering the rth derivative of t*—*(t — 1)¥ att = 1. 


18. Prove by induction (using the binomial theorem) 
n\ 1 n\ 1 “.««Γλὶ 1 1 1 
) ir (Geran ) Carita tet s 


n—l 
Verify (12.19) by integrating the identity >) (1 — δ᾽ = {1 — (1 — #)"}é-. 
0 
19. Show that for any positive integer m 


On Oye 
αἴοιοῖ 7 * 


(12.20) (a@ty+tz2"™=> 


where the summation extends over all non-negative integers a, b, c, such that 
atb+c=m. 
20. Using Stirling’s formula, prove that 


(12.21) (7) ~ (πη) ὅν, 


21. Prove that for any positive integers a and ὃ 


(a+ ia+2)--(a+n) δι᾽. 
ae G+)O+2)-O+n) al” ἢ 


22. The gamma function is defined by 
(12.23) T(z) = f 2?—le-* dz 
0 


where x > 0. Show that I(x) ~ (27)'e-*z7—*. (Notice that if 2 = n is an 
integer, I(n) = (n -- ἢ) 
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23. Let α and r be arbitrary positive numbers and n a positive integer. 
Show that 


(12.24) α(α + r)(a + 27)" - «(α + nr) ~ Cretinn tir) +he—n, 


[The constant C is equal to (2r)!/T(a/r).] 
24. Using the results of the preceding problem, show that 


α(α + r)(a + 2r)---(a + nr) on T(b/r) eye 
b(6 + r)\(b + 2r)---(6+ nr) T(a/r) ; 


25. Prove the following alternatwe form of Stirling’s formula: 

(12.26) n! ~ (Qr)i(n + 3)™tHhe—@ th, 
26. Continuation. Using the method of the text, show that 

(12.27)  (Qr)H(n + Bt tHe @ 4D τι βάν ἘΚ nl « (Qn)Kn + Bt 0 τ mt), 
27. Extending Stirling’s formula, prove that 


(12.25) 


1 1 
! r~ 3 n+t --- ----.-. ... ...... Py 
(12.28) n! ~ (2π)"η 1 exp n + ian ~ 360n8 +... 


CHAPTER III* 


Fluctuations in Coin Tossing 


and Random Walks 


This chapter serves two purposes. First, it will show that exceed- 
ingly simple methods may lead to far-reaching and important results. 
Second, in it we shall for the first time encounter theoretical conclusions 
which not only are unexpected but actually come as a shock to intuition 
and common sense. They will reveal that commonly accepted notions 
concerning chance fluctuations are without foundation and that the 
implications of the law of large numbers are widely misconstrued.! 

The discussion is inserted at this place only because of its elementary 
character; the main topic of the book continues in chapter V. The 
entire book is independent of the present chapter. Some of the formulas 
will reappear later in connection with first passages and recurrence, but 
they will be derived anew by analytical methods. A comparison of 
methods should prove instructive and interesting. Accordingly, the 
present chapter should be read at the reader’s discretion independently of, 
or parallel to, the remainder of the book. To facilitate such a procedure, 
this chapter may be read in two versions: the main text appears in 
ordinary type. Passages in small type cover additional topics (refer- 
ring mainly to first passage and recurrence phenomena) and should be 
omitted at first reading. Section 7 contains an empirical illustration. 


* This chapter may be omitted or read in conjunction with the following chapters. 
Reference to its contents will be made in chapters X (laws of large numbers), XI 
(first-passage times), XIII (recurrent events), XIV (random walks), but the con- 
tents will not be used explicitly in the sequel. 

1 Although we are dealing formally only with coin tossing, the basic conclusions 
are widely applicable. In fact, E. Sparre Andersen has made the surprising dis- 
covery that many facets of the fluctuation theory of sums of independent random 
variables are of a purely combinatorial nature and are common to a huge class of 
such variables. This is true, in particular, of the two arc sine laws. See Mathe- 
matica Scandinavica, vol. 1 (1953), pp. 263-285, and vol. 2 (1954), pp. 195-223. 
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1. GENERAL ORIENTATION 


A surprising wealth of information concerning chance fluctuations in 
general will be derived from the following inconspicuous lemma an- 
nounced in 1887 by Bertrand. Similar problems of arrangements have 
attracted the interest of students of combinatorial analysis under the 
name of ballot problems.2. Suppose that, in a ballot, candidate P scores p 
votes and candidate Q scores q votes, where p > gq. The probability that 
throughout the counting there are always more votes for P than for Q equals 


(p — g)/(p + Q). 


In mathematical language we are here concerned with arrangements 
of x = p + q symbols εἰ, €2, ..., ἐς consisting of p plus ones (votes for 
P) and q minus ones (votes for Q). The partial sum sz, = εἰ + € + 
+...-+ «is the number of votes by which P leads, or trails, just after 
the kth vote is cast. Clearly s, = p — q and 


(1.1) δὲ — S_1) = εἰ = +1, 8S = 0 (S162, ἐὼν, 1): 


Conversely, every arrangement (81, So, ..., 8} of integers satisfying 
(1.1) represents a potential voting record. We shall use a geometrical 
terminology and represent such an arrangement by a polygonal line 
whose 7th side has slope εἰ and whose ith vertex has ordinate s;. Such 
lines will be called paths. 


Definition. Let x > 0 and y be integers. A path {s1, So, ..., Sc} 
from the origin to the point (x, y) 1s a polygonal line whose vertices have 
abscissas 0, 1, 2, ..., x and ordinates 80, 81, 82), ..., Sz satisfying (1.1) 
with 8, = ψ. 

If » among the ε; are positive and q negative, then 


(1.2) ΦΈΡΩ, Yy=rp-y. 


An arbitrary point (x, y) can be joined to the origin by a path only if 
x and y are of the form (1.2). In this case the p places for the positive 


2 For the history and literature see A. Dvoretzky and T. Motzkin, A problem of 
arrangements, Duke Mathematical Journal, vol. 14 (1947), pp. 305-313. As these 
authors point out, most of the formally different proofs in reality use the reflection 
principle (lemma 1 of section 2), but without the geometric interpretation this 
principle loses its simplicity and appears as a curious trick. Dvoretzky and Motzkin 
give a new proof of great simplicity and elegance. They generalize the ballot prob- 
lem by requiring that at each instant P have at least a times the votes scored by 
Q. This work has been continued by M. T. L. Bizley, Derivation of a new formula 
for the number of minimal lattice paths, etc., The Journal of the Institute of Actu- 
aries, vol. 80, Part 1, No. 354 (1954), pp. 55-62. | 


ΠΙ.1] GENERAL ORIENTATION 67 


δ: can be chosen from the z = p + q available places in 


(1.3) Ney = 4 4 ‘) -: ἡ 4 ᾿ 


different ways. It is convenient to define Nz, = 0 whenever x, y are 
not of the form (1.2). Then there exist exactly Nz,, different paths from 
the origin to the point (x,y). Bertrand’s ballot theorem asserts that 
when y > 0 there exist exactly (y/x)Nz,, paths satisfying the condi- 
tions s; > 0, s. > 0, ..., 8-1 > 0, s = y. It will be proved in sec- 
tion 2. 


Example. Figure 1 exhibits a path to the point N; = (5,1). There 
exist ten such paths of which two satisfy the conditions s; > 0. The 
path in the graph is {1, 2, 1, 2, 1}, and the other is {1, 2, 3, 2, 1}. 


Fiaure 1. Illustrating positive paths and the proof of theorem 2 in section 2. 


We can draw the most interesting conclusions from the ballot theo- 
rem if we drop the convention that the terminal point (z, y) of the path 
be fixed in advance. There exist 2” different paths from the origin to 
points (n, y) with an arbitrary ordinate y. As explained in section 3, 
these 2” paths may be taken to represent the 2” possible outcomes of 
the ideal experiment consisting in n successive tossings of a perfect 
coin. The classical description introduces the fictitious gambler 
Peter who at each trial wins or loses a unit amount. The sequence 
{81, 82, ...) Sn} then represents Peter’s successive cumulative gains, that 
is, the excess of the accumulated number of heads over tails. 

If s, = 0, the net gain at the conclusion of the nth trial is zero: there 
exists a tie. Ties occur so infrequently that they do not affect the pic- 
ture, but repeated references to them are disturbing. We shall there- 
fore agree to say that at the nth trial Peter leads if either s, > 0 or 
8, = 0 but sn_; > 0 (i.e., in case of a tie that player leads who led at 
the preceding trial). ‘Peter leads at the nth trial’ is but a description 
for “the nth side of the path is above the z-axis.”’ 

The ballot theorem refers to paths situated entirely above the z-axis, 
that is, to games in which the lead never changes. This topic may be 
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pursued further by investigating how often the lead is likely to change 
for an arbitrary path. In this connection we reach conclusions that 
play havoc with our intuition. It is generally expected that in a pro- 
longed series of coin tossings Peter should lead about half the time and 
Paul the other half. This is entirely wrong, however. Jn 20,000 toss- 
ings it 18 about 88 times more probable that Peter leads in all 20,000 trials 
than that each player leads in 10,000 trials. In general, the lead changes 
at such infrequent intervals that intuition is defied. No matter how 
long the series of tossings, the most probable number of changes of lead 
is zero; exactly one change of lead is more probable than two, two 
changes are more probable than three, etc. In short, if a modern 
educator or psychologist were to describe the long-run case histories 
of individual coin-tossing games, he would classify the majority of 
coins as maladjusted. If many coins are tossed n times each, a sur- 
prisingly large proportion of them will leave one player in the lead 
almost all the time; and in very few cases will the lead change sides 
and fluctuate in the manner that is generally expected of a well-behaved 
coin. 

This is a sample of the conclusions to be drawn from the first arc 
sine law (see section 5 and the illustration in section 7). Εἰ. Sparre 
Andersen has shown that this law has a wide field of applicability, and 
the situation here described for coin tossings is typical for chance fluc- 
tuations involving cumulative effects. Most stochastic processes in 
physics, economics, and education are of this nature, and our findings 
should serve as a warning to those who are prone to discern secular 
trends and deviations from average norms. 


The same situation may be viewed from a somewhat different angle. If the coin 
tossing proceeds at a uniform rate, common sense expects that, with due allowance 
for chance fluctuations, a two-day game should produce twice as many ties as a 
one-day game. In other words, we expect intuitively the number of ties to increase 
roughly in proportion to the duration of the game. Paradoxically this is not so: 
The number of ties increases about as the square root of time. In 10,000 tossings the 
median number of ties is 67, but in 1,000,000 tossings it increases only to 674; the 
typical ‘‘wavelength”’ increases from about 150 to about 1500. The average wave- 
length increases with time (sections 6 and 8). The formulas on which these conclu- 
sions are based play an important role for first passage and recurrence times in 
general random walks and diffusion theory. 

Theorem 3 of section 2 stands apart from the remainder and is not used elsewhere. 
It concerns a variant of the ballot problem for the case where the two candidates 
score the same number, n, of votes. Then P leads an even number, 2k, of times and 
Q leads in the remaining 2n — 2k trials. Again we have the false intuition that 
each candidate is likely to lead about half the time, that is, we expect 2k to be 
close ton. Actually, if the ballot ended in a tie n:n, the n + 1 possible divisions of 
leads (namely 2n:0, 2n—2:2, 2n—4:4, ..., 2:2n—2, 0:2n) have the same probability 
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(n +1). This result stands in a marked contrast to the situation described 
above where the end result was not prescribed in advance; there the extreme divi- 
sions 2n:0 and 0:2n are most probable. 

It has been pointed out by J. L. Hodges * that this theorem has statistical appli- 
cations to rank-order tests. We illustrate this point by the 


Example. Suppose that a quantity (e.g. the height of plants) is measured on 
each of n treated subjects and also on each of n control subjects, obtaining measure- 
ments a, ..., Gn and by, ..., bn. To fix ideas, suppose that each group is arranged 
in decreasing order: αἱ > a2 >... and ὃ > be >.... Let us combine the two 
sequences and write the 2n letters a1, ..., bn in decreasing order. The resulting 
arrangement of n letters a and n letters b may be interpreted as the record of a 
ballot in which each candidate received n votes. For an extremely successful treat- 
ment all the a’s should precede the b’s; a completely ineffectual treatment should 
produce a random order. In our arrangement the a’s lead exactly 2k times if k 
different a’s precede the b’s of same rank, that is, if the inequality a; > 6; holds for 
exactly k subscripts. Assuming randomness, the probability that this happens 
equals 1/(n + 1) and therefore the probability that the a’s lead 2k times or more 
is (n ~k +1)/(n +1). The classical example for this argument (used qualita- 
tively without knowledge of the theoretical probabilities) is due to Galton who used 
it in 1876 for data referred to him by Charles Darwin. In his example 2n was 30 
and the a’s were in the lead 26 times. Galton concluded that the treatment was 
efficient, but on the hypothesis of mere randomness even an ineffectual treatment 
would produce 26 or more leads in three out of sixteen similar experiments. This 
shows that a qualitative analysis may be a valuable supplement to our rather shaky 
intuition. (For related tests based on the theory of runs see chapter II, section 5a.) 


2. PROBLEMS OF ARRANGEMENTS 


Let A = (a,a) and B = (6,8) be integral points in the positive 
quadrant:b >a>0,a>0,8>0. By reflection of A on the x-axis 18 


B 


NY 


Figure 2. Illustrating the reflection principle. 


meant the point A’ = (a, —a). (See figure 2.) A path from A to B is 
defined as in section 1, with A playing the role of the origin. 


3 Galton’s rank-order test, Biometrika, vol. 42 (1955), pp. 261-262. 
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Lemma.‘ (Reflection principle.) The number of paths from A to B 
which touch or cross the x-axis equals the number of all paths from A’ to B. 


Proof. Consider a path {sg = α, 8041, ---, 8 = B} from A to B 
having one or more vertices on the z-axis. Let ὁ be the abscissa of the 
first such vertex (see figure 2); that is, choose ¢ so that s, > 0, ..., 
8:1 > 0, 3, = 0. Then {—S., —Sa41, ---, —St—1, 8ὲ = O, St41, Ste, 
..., 80} is a path leading from A’ to B and having T = (¢, 0) as its first 
vertex on the z-axis. The sections AJ and A’T being reflections of 
each other, there exists a one-to-one correspondence between all paths 
from A’ to B and such paths from A to B as have a vertex on the 
z-axis. The lemma is proved. | 


Theorem 1. (Ballot theorem.) Let x > 0, y > 0; the number of paths 
{81, So, ..., 82 = y} from the origin to (x, y) such that s,; > 0, 82 > 0, 
. ee, Sz > O equals (y/x)Nz,y. 

Proof. Since s; = +1, we have s, = 1 for each admissible path. 
It follows that there exist as many admissible paths as there are paths 


leading from the point (1, 1) to (z, y) which neither touch nor cross 
the z-axis. By the last lemma the number of such paths equals 


p+tq-1 p+q-1 
(2.1) Neoty—r — Ne-1y+ = ( ) 7 ( ) 
p—l q—1 


-ξ- {ρ τὴ -ἢ 
pt+aq\ p x 


The Duality Principle. Almost every theorem on paths can be reformulated 
to obtain a formally different theorem. Consider {s;,..., δι} and the path ob- 


LY 


tained from it by reversing the order of the εἰ, that is, the path {s1*, s2*, ..., 82*} 
where 81" = εχ, 82} = εχ, + €x-1, 83* = eg + ez—1 + €z-2, --- or 


The two paths (1.1) and (2.2) are congruent and are obtained from each other by 
a rotation through 180 degrees; they join the same endpoints. 700 each theorem on 
paths there corresponds a dual theorem obtained by applying it to the reversed path (2.2). 


For example, the ballot theorem gives us the number of reversed paths {81*,..., 
..., 8z*} joining the origin to (2, y) such that s;* > Ofor7 = 1,2,...,2%. But thisis 


4 The probability literature attributes this method to D. André (1887). The text 
reduces it to a lemma on random walks. The classical difference equations of ran- 
dom walks (chapter XIV) closely resemble differential equations, and the reflection 
principle (even a stronger form of it) is familiar in that theory under the name of 
Lord Kelvin’s method of images. 
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the same as 8, > 8;_; for + = 1, 2, ..., x—-1 and hence we have as an alternative 
form of the ballot theorem 


Theorem 1*. The number of paths 81, 82, ..., 82} from (0, 0) to (x, y) such that 
81 < 85, 8 < Sz, ..., 82-1 < Sz (where 85 = y > 0) equals (y/x)Nz,. 


Geometrically speaking, theorem 1 is concerned with paths whose left endpoint 
is the lowest vertex, whereas the dual theorem 1* refers to paths whose last vertex 
is highest. (See figure 3.) Theorem 1* has implications for first-passage times in 
random walks. 


5 


Figure 3. [Illustrating first passages and returns to the origin. 
We turn to a study of paths joining the origin to a point N = (2n, 0) 


of the z-axis (an odd vertex on the x-axis is impossible). Put for 
abbreviation 


2.3 ee (“1 
(2.3) an nt+1\n 


| 2n 
Theorem 2. Among the ( ) paths joining the origin to the point 2n 
n 


of the x-axis there are 
(a) exactly Len—2 paths such that 

(2.4) & >0, s>0, ..., Seat > Ὁ, (Son = 0) 
(b) exactly Lon paths such that 

(2.5) 8 >0, sg>0, ..., 855. > 0, (Son = 0). 


(That is, there are as many paths to 2n with all inner vertices above the 
z-axis as there are paths to 2n — 2 with no vertex below the x-axis.) 


Proof. (See figure 1.) Each path satisfying condition (2.4) passes 
through the point N; = (2n—1, 1) and by theorem 1 the number of 
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paths to N, such that s; > 0, ..., 82,..2 > Ὁ equals 


2.6) 1 (” - Ἵ 1 (” - ἢ L 

In-1\n—-1/ n\n—1 653: 

This proves (a). Again, let a path satisfy condition (2.4). Omitting 
the first and the last side we get a path that joins the point O, = (1, 1) 
to Ν᾿ = (2n—1, 1) and at the same time is such that all its vertices 
lie on or above the line y = 1. Translating the origin to Οἱ, we get a 
path from the new origin to the point Ν᾿ (which has the new coordi- 
nates 2n —2 and 0), none of whose vertices lies below the new z-axis. 


We have thus established a one-to-one correspondence between such 
paths and all paths satisfying (2.4), and the theorem is proved. 


As explained in section 1 the following theorem stands apart from the remainder 
and will not be used in the sequel. 


Theorem 3.5 Let Lox,on be the number of paths from the origin to the point 2n of 
the x-axis such that 2k of tts sides lie above the x-axis and 2n — 2k below (k = 0,1,..., 
wo, ἢ). Then Lox,on = Lon, independently of k. 


Proof. The assertion is trivially true for n = 1 and we assume by induction that 
Lak,2 = ἴων for vy = 1, 2,...,n—landO <k <». We propose to count the num- 
ber of paths {s1, 82, ..., Son = 0} with exactly 2k sides above the z-axis. First 
assume 1 <k<n-—41. Such a path crosses the z-axis and we denote by 2r the 
abscissa of its first vertex on the x-axis. We have then to consider two classes of 
paths. 

A path of the first class is positive between 0 and 2r, and its section between 2r 
and 2n contains exactly 2k — 2r sides above the axis. Here k >r. By theorem 
2(a) there exist Lnr_2 paths {s1, ..., Ser—1, Sor = O} with 81» 0, ..., 87-1 > 0, and 
by the induction hypothesis there exist Le,—27,2n—-27 = Len—2r paths joining (2r, 0) 
to (2n, 0) and having 2k — 2r sides above the z-axis. Accordingly, there exists a 
total of Le-—2Len—e, paths of this class. 

A path of the second class is negative between 0 and 2r; its section between 2r 
and 2n then contains 2k sides above the z-axis. By the argument above there 
exist again Le,_oLen—2, paths of this class, but this time πὶ — r > k. 

It follows that fork = 1, ...,n—1 


k n—k 
(2.7) Lox,2n = de Lor—oLen—or + 2, Lar—aLan—2r. 


By changing the summation index to p =n —r +1, the terms of the second 


5 First proved by complicated analytical methods by K. L. Chung and W. Feller, 
Fluctuations in coin tossing, Proceedings National Academy of Sciences USA, vol. 
35 (1949), pp. 605-608 (see also the first edition of the present book, chapter ΧΗ, 
problem 4). An elegant combinatorial proof was given by J. L. Hodges (see foot- 
note 3). 
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sum become Le,—-2Len—2r = Le,—2Len—2, with p running from k +1ton. Thus 


(2.8) Tnx 2n ed Σ, 1.ω,--21.25.--3ρ, 
p= 


which is zndependeni of k. 

A path with all 2n sides above the x-axis is a path of the sort described in theorem 
2(b), and hence Len,on = Len. For reasons of symmetry we have also Loon = Lon. 
The total number of paths from the origin to (2n, 0) being (n + 1)Zean, it follows 
that Leon = Len fork = 0,1, ..., n. 


As a corollary we find the identity 
nm 
(2.9) Lan = 2, Lay—21an—tp 
| ma 
For a direct analytic verification see section 8(a). 


3. RANDOM WALKS AND COIN TOSSING 


In a sequence of N tossings of an ideal coin let ες = +1 if the kth 
trial results in heads and «, = —1 otherwise. Then s,; = εἰ + € + 
+...+ ἐκ 15 the cumulative excess of heads over tails at the conclusion 
of the kth trial. In classical betting language s, is ‘““Peter’s accumu- 
lated net gain.’”’ Each possible outcome of the N successive tossings 
is represented by a path of N sides starting at the origin, and conversely 
each such path may be taken as representing the outcome of N tossings. 

This consideration leads us to take for our sample space the aggregate 
of the 2% paths 8., ..., $w} starting at the origin and to attribute proba- 
bility 2—% to each. 

An event such as “heads at the first two trials’ must be interpreted 
as the aggregate of all sequences starting with 81 = 1, 82 = 2. There 
are 2%—? such sequences and the probability of this event is therefore 
2~*, as is proper. More generally, if k < N there exist exactly 2" —* 
different paths 81, se, ..., $v} such that their first k vertices lie on a 
preassigned path 8:1, 82, ..., 81}. It follows that an event determined 
by the outcome of the first k < N trials has a probability independent of 
N. In practice, therefore, the number N plays no role, provided it is 
sufficiently large. Conceptually and formally it is best to consider each 
finite sequence of tossings as the beginning of a potentially infinite se- 
quence, but this would lead us into non-denumerable sample spaces. 
We shall therefore consider finite sequences with N larger than the 
number of trials occurring in the formulas; except for this we shall be 
permitted, and be glad, to forget about N. 

For the probabilistic background and the connection with related 
topics it is desirable to supplement the geometric language by an alter- 
native terminology. We imagine the coin tossings performed at a uni- 
form rate, so that the nth trial occurs at tame ἢ. Peter may mark his 
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cumulative gain at all times by an indicator which we shall call “‘par- 
ticle.” This particle, then, moves on a vertical axis starting from the 
origin. It moves at times 1, 2, ... one unit step upward if the coin 
lands heads, one unit step downward if the coin lands tails. We say 
that the particle performs a symmetric random walk. (The physicist 
takes it as the simplest model for one-dimensional diffusion; see chap- 
ter XIV.) 

At time n the position of the particle is the point s, of the vertical 
axis. The path {s, so, ..., Sn} represents the space-time diagram of. the 
random walk, the x-axis playing the role of the time axis. 

Guided by this background we introduce the following 


Terminology. We shall say that at tume n there takes place: 
A return to the origin if 8, = 0. 
A first return to the origin af 


(3.1) $1 ~~ 0, S92 x 0, ee ey Sn—1 a 0, 8, = 0. 
A first passage through r > 0 af 
(3.2) ϑ8ι.ι ΖΦ“, Ses, ky Se ST. 8, ΞΥ͂. 


A second, third, ... return to the origin and a first passage through 
r < 0 are defined in an obvious way. Note that passages through the 
origin can take place only at even times, and we shall frequently restrict 
the formulas to even times. In betting language a return to the origin 
represents an equalization of the accumulated numbers of heads and tails. 
(Figure 3 exhibits two paths in which the first passages and returns to 
the origin, respectively, are marked; the second path has the peculiarity 
of keeping to the negative side.) 


4, REFORMULATION OF THE COMBINATORIAL THEOREMS 


In the following sections we shall use the notations 


2n | 
(4.1) ΠΕ ( ἸΔῈ ΠΝ 
n 
and 
1 
(4.2) fo — 0, fon = = U2n—-2, r= 1, 2, eee 
2n 


It is easily verified that 

(4.3) fon = Uen—2 — Uen, n=1,2,.... 
Theorem 1. For each n > 1: 

(4.4) Uan = P{8en = 0} 
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(4.5) lon = P{s; ~ 0, 8. #0, ..., San ¥ 0} 


(4.6) “on = P{s; > 0, ss > 0, ..-, Sen > 0} 


or in words: The three events, (a) a return to the origin takes place at time 

Qn, (Ὁ) no return occurs up to and including tume 2n, and (6) the path 

is non-negative between 0 and 2n, have the common probability uan. 
Furthermore, 


(4.7) fon = P{s: ¥ 0, 82 € 0, ..., San_1 ¥ 0, Son = 0} 
(4.8) San = Pi{s; = 0, So = 0, eee, SQn—2 => 0, Son—1 < 0} 


that is: the two events (a) the first return to the origin takes place at time 
Qn, and (Ὁ) the first passage through —1 occurs at time 2n — 1, have the 
common probability fon. 


Proof. As was observed in section 3 it suffices to consider the sample 


space of paths of the fixed length 2n. By (1.3) there exist (1 paths 
joining the origin to the point (2n, 0), and this proves (4.4). ᾿ 

By theorem 2(α) in section 2 there exist Lon_2 paths joining the 
origin to (2n, 0) such that s; > 0, ..., S2n_1 > 0. Therefore there are 
twice as many paths satisfying the condition in (4.7), and the corre- 
sponding probability is QLen—2°2 2" = fon. Theorem 2(b) in section 2 | 
implies in the same way (4.8). 

The probability that no zero occurs up to and including time 2n 
equals one minus the probability of a first return to the origin at a 
time <2n. Using (4.7) this difference is 


(4.9) 1—fe—fs —.--—fon= 
1 — (1 — ue) — (ὡς — U4) —... — (Uen—2 — Uan) = Ulan 


which proves (4.5). Similarly, the right side in (4.6) equals one minus 
the probability of a first passage through —1 before time 2n, and using 
(4.8) this difference is again given by (4.9). This accomplishes the 
proof. 


Corollary. Jt follows that for n > 1 


(4.10) | Ven = >> fertion—ar- 


To] 
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Proof. If a return to the origin takes place at time 2n, then the first 
return must take place at some time 2r < 2n. We have just seen that 
the number of paths from the origin to (2n, 0) with the first return to 
the origin taking place at time 2r < 2n equals 2277... 25 Τρ... 2». 
Summing over r, we get equation (4.10). (For a direct analytic proof 
see section 8(a). In chapter XIII, section 3, we shall see that (4.10) is 
a special case of the basic equation for recurrent events.) 


Theorem 1* in section 2 enumerates the paths in which a first passage through 
y occurs at time xz. The sum x + y must be even, and for our purposes it is con- 
venient to put z = 2n — y. The content of theorem 1* may then be restated as 
follows. 


Theorem 2. The probability that a first passage through y > 0 takes place at 
tume 2n — y ts given by 


2n — 
(4.11) poe, ΕΝ τ n>y>0. 


The simplicity with which the duality principle delivered this important formula 
as a direct consequence of the ballot theorem is truly remarkable. A direct analytic 
derivation of (4.11) is difficult and requires special tricks. 

In principle, the probabilities f{ can be calculated by induction on y. A path 
of length 2n — y — 1 in which a first passage through y + 1 occurs at the terminal 
point may be decomposed into two segments (see figure 3 for y = 4). The first 
segment is the path from the origin up to the point of the first passage through 7; 
it occurs at some time 2ν — y < 2n — y — 1. This section is followed by the sec- 
ond, a section of length 2n — 2ν — 1 in which the terminal endpoint is the only 
one lying above the left endpoint. In other words, if its left endpoint is taken as 
the origin, the second section represents a path with a first passage through 1 at 
the endpoint. By definition there exist 2”—%% sections of the first type and 
92η-ῶν--͵ἼἸ 1) 2, of the second, and any two can be combined to give a path with 
first passage through y + 1 at time 2n — y — 1. Therefore 


n—l1 
(4.12) FEF = VIPs  :; n>y +1. 
v= 


Formula (4.8) states that a first passage through —1 (and hence also through +1) 
at time 27 — 1 has probability fen, that is, 


(4.13) Sin = fon n> 1. 
Equations (4.12) and (4.13) determine recursively all f?, but it is not easy to verify 
that (4.11) satisfies (4.12), and it is not at all clear how the explicit formula (4.11) 
could be derived from (4.12). 


Formulas (4.12)-(4.13) permit a novel conclusion, We see from (4.13) that 74} 
is the probability that the first return to zero occurs at time 2n. Forgetting about 
the preceding theorem, let us now define f$ as the probability that the yth return 
to zero takes place at time 2n. The argument used in the last proof applies with- 


out change: Splitting a path from the origin to the (y+1)st return into the initial 
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section leading to the yth return and the terminal section between the yth and the 
(y+1)st return, we see again that (4.12) holds. Since this relation uniquely de- 
termines all f§” we have 


Theorem 3. The probability that the yth return to zero takes place at time 2n is 
given by (4.11). 


Alternative geometric proof. Consider a path leading from the origin to a first 
passage through y at time 2n — y. (Figure 3 exhibits the case y = 5, 2n — y = 15.) 
Construct a new path by inserting into this path y new sides each of slope —1 
and having left endpoints, respectively, at the origin and the y — 1 vertices at 


which a first passage through 1, 2, ..., y—1 takes place. The new path, say 
{o1, σὰ, ..., gan}, has length 2n. Clearly σι <0, ..., o2n-1 <0, oon = 0, and 
exactly y — 1 interior vertices lie on the z-axis. Conversely, each path (σι, ..., can} 


with this property is obtained, in the manner described, from a path with first 
passage through y at time 2n — y. If f$¥ is defined as in theorem 2, we see that 
there exist exactly 2?"-¥f§! paths {o1, ..., o2n} such that o; < 0, von = 0, and 
exactly y — 1 interior vertices lie on the z-axis. Such a path consists of y sections 
with endpoints on the z-axis, and we can produce 2” different paths by changing 
the signs of all σὲ of one or more such sections. In this way we obtain all paths of 
length 2n with 825 = 0 and exactly y — 1 inner vertices on the z-axis, and their 
number is therefore 225 04) as asserted. 


5. PROBABILITY OF LONG LEADS: THE FIRST ARC SINE 
LAW 


We shall say that the particle spends the time from k — 1 to k on the 
positive side if the kth side of its path lies above the x-axis, that is, if at 
least one of the two vertices s,_; and s; is positive (in which case the 
other is positive or zero). In the betting terminology this means that 
at both the (k—1)st and the kth trial Peter’s accumulated gain was 
non-negative. 

The paradoxical properties of the paths mentioned in section 1 will 
be derived from the following 


Theorem 1.5 Let porn be the probability that in the time interval 
from 0 to 2n the particle spends 2k time units on the positive side and 
2n — 2k time units on the negative side. Then 


(5.1) P2k,2n = U2klan—2k- 
(Note that the total time spent on the positive side is necessarily even.) 


Proof. The probability that the particle keeps to the positive side 
during the entire time interval from 0 to 2n is given by formula (4.6), 


* First proved by complicated analytical methods by K. L. Chung and W. Feller 
(see footnote 5 and the first edition of the present book, chapter XII, sections 5 
and 6). The theorem was suggested by the work of E. Sparre Andersen (see foot~- 
note 1). 


78 COIN TOSSING AND RANDOM WALKS [III.5 


and we see that ponon = Uen as asserted. For reasons of symmetry 
we have also oon = Uen, and it remains only to prove (5.1) for 
1<k<n-—41. For that purpose we repeat the argument which led 
to (2.7). A particle that keeps for 2k > 0 time units to the positive 
side and for 2n — 2k > 0 time units to the negative side necessarily 
passes through zero. Let 2r be the moment of its first return to zero. 
Then the path belongs to one of the following two classes. 

In the first class, up to time 2r the particle stays on the positive 
side, and during the time interval from 2r to 2n it spends exactly 
2k — 2r > 0 time units on the positive side. There exist 2275, paths 
of length 2r which return to the origin for the first time at 27, and half 
of them keep to the positive side. Furthermore, by definition, there 
are 2" ρα. or, on—2r paths of length 2n — 2r starting at (2r,0) and 
having exactly 2k — 2r sides above the z-axis. Thus the total number 
of paths of length 2n in the first class equals 


1 o92r 2n—2 . 92n—1 
5°2 for 2 "Dok—2r,2n—2r = 2 forDok—2r,2n—20- 


In the second class, from 0 to 2r the particle keeps to the negative 
side, and between 2r and 2n it spends 2k time units on the positive side. 
Here 2k < 2n — 2r and the argument above shows that the number 
of paths in this class equals 2°"—"fo,pox.on—2r- 

It follows that forl1 << k<n-1 


—k 
(5.2) P2k,2n Ξ ὃ ΕΣ ee 2n—2r aie 2 +> ἧι 2n—2r- 
r=} r=] 


Suppose now by induction that por.2, = Uextler,—oxr for v = 1, 2, 
n—1 (this relation being trivially true for ν = 1). Then formula (δ. 2) 
reduces to 


k n—k 
(5.38) Peron = ξ Uan—2k » Sortek—ar + 5 Usk Σ, Sorlian—ok—ar- 
ro] γΞε 1 


In view of equation (4.10), the first sum equals wo, and the second 
equals v2n_2% and therefore (5.1) holds. 


We feel intuitively that the fraction k/n of the total time spent on the 
positive side is most likely to be close to 4. However, the opposite is 
true: The possible values close to 4 are least probable and the extreme 
values k/n = Ο and k/n = 1 have the greatest probability. This assertion 
can be verified using a ratio test on (5.1). 

Table 1 illustrates the paradox. In betting terminology it reveals 
the startling fact that in 2n = 20 tossings of a perfect coin with proba- 
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bility 0.3524 the less fortunate player will never be in the lead. In most 
cases (with probability 0.5379) the accumulated gain of the less fortu- 
nate player will be positive just once or never. By contrast, an equal 
division 10:10 of the leads has a probability of only 0.0606. 


TABLE 1 


DISTRIBUTION OF LEADS IN 20 Tossgs or A CoIN 


k=0 |k=2 |k=4 | k= k=8 |. _ 4 

k=20|k=18 | κ- 16, ΕΞ 14} k=12 | “= 
Deno = 0.1762 | 0.0927 | 0.0736 | 0.0655 | 0.0617 | 0.0606 
Ριο = 0.3524 | 0.5379 | 0.6851 | 0.8160 | 0.9394 1 


Pk,20 = UxKUa 18 the probability that k sides of the path are above the axis, 
i.e., ‘Peter leads during exactly k out of the 20 trials.” 

Px,20 18 the probability that one of the players is in the lead for at least k 
trials, the other for at most 20 — & trials. 


Formula (5.1), although exact, is not very revealing, and it is pref- 
erable to replace it by a simpler approximation. An easy application 
of Stirling’s formula II(9.1) shows that uen(rn)t > 1 as n > o@, 
[This is the content of problem IT(12.20).] It follows that 


1 


5.4 21 ~~ 
(5.4) P2k.2 fea = BI 


where the ratio of the two sides tends rapidly to unity as k > o and 
n —k —» «. The probability that the fraction k/n of the time spent 
on the positive side lies between 5 and a (ξ < a < 1) is given by 


(5.5) TS Pata ~— Σ᾿ = ( 7 )t 


gn <k <an WN jn<k<an (nN n 


On the right side we recognize the Riemann sum approximating the 
integral 


| = dz 
(5.6) a | ied — τὴν = 2r—" are sin at — ἃ, 
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For reasons of symmetry the probability that k/n < ξ tends to + as 
n — o, Adding this probability to (5.5), we get 


Theorem 2.7 (The first arc sine law.) For fixed a (0 < a < 1) and 
n —» © the probability that the fraction k/n of time spent on the posttiwe 
side be <a tends to 


(5.7) πὶ [ - δος 9g are sin ab. 
ο {x(1 — x)}} 


In practice formula (5.7) provides an excellent approximation even 
for values of n as small as 20. The integrand in (5.7) is represented 
by a U-shaped curve tending to infinity at the endpoints 0 and 1. This 


ΒΞ, a 
== arc sin t 


| 
| 
| 
| 
| 
| 
| 
ιΩ 
o 
Figure 4. The are sine law. 


shows in a striking fashion that the fraction of time spent on the posi- 
tive side is much more likely to be close to zero or to one than to the 
“expected” or “normal” value 3. Figure 4 will reveal: 


7 Paul Lévy (Sur certains processus stochastiques homogénes, Compositio Mathe- 
matica, vol. 7 (1939), pp. 283-339) found the arc sine law for certain continuous 
diffusion processes and referred to the connection with the coin-tossing game. A 
general arc sine law for the number of positive partial sums in a sequence of mu- 
tually independent random variables was proved by P. Erdés and M. Kae, On the 
number of positive sums of independent random variables, Bulletin of the American 
Mathematical Society, vol. 53 (1947), pp. 1011-1020. It was E. Sparre Andersen 
who discovered the combinatorial nature of the are sine law and its validity for 
general classes of random variables. 
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With probability 0.20 the particle stays for about 97.6 per cent of the 
time on the same side of the origin. In one out of 10 cases the particle 
will spend 99.4 per cent of the time on the same side. Another illustration 
is given in table 2. 

TABLE 2 


ILLUSTRATING THE ARC SINE Law 


Pp tp 
0.9 153.95 days 
8 126.10 days 
7 99.65 days 
6 75.23 days 
5 53.45 days 
4 34.85 days 
3 19.89 days 
2 8.93 days 
1 2.24 days 
05 13.5 hours 
02 2.16 hours 
01 32.4 minutes 


A coin is tossed once per second for a total of 365 days; let Z be the fraction 
of time during which the less fortunate player is in the lead. Then ¢, is a 
number such that the event Z < t, has probability p, approximately. 


This table shows the probability p that the less fortunate player will 
be in the lead for a total of less than ¢, days of a full year. Using, for 
example, the significance level p = 0.05 dear to statisticians, we see 
that in one out of 20 cases the more fortunate player will be in the lead 
for more than 364 days and 10 hours. Few people will believe that a 
perfect coin will produce preposterous sequences in which no change 
of lead occurs for millions of trials in succession, and yet this is what 
a good coin will do rather regularly. 

In the next section we shall treat another aspect of the same phe- 
nomenon, and in section 7 we shall illustrate the theory by empirical 
material. 


6. THE NUMBER OF RETURNS TO THE ORIGIN 


The explanation of the arc sine law lies in the fact that frequently enormously 
many trials are required before the particle returns to the origin. Geometrically 
speaking, the path crosses the z-axis very rarely. 

We feel intuitively that if Peter and Paul toss a coin for a long time 2n, the 
number of ties (moments when the cumulative scores are equal) should be roughly 
proportional to 2n. But this 1s not so. Actually the number of ties increases in 
vrobability only as (2n)?; that is, with increasing duration of the game the frequency 
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of ties decreases rapidly, and the ‘‘waves” increase in length. In analyzing this 
situation we shall consider the number of returns to zero. It should be borne in 
mind that the number of times when the particle actually crosses from the positive 
side into the negative or conversely is roughly one-half the number of returns. 


Theorem 1. Let 25, be the probability that up to and including time 2n the particle 
returns to zero exactly r times. Then 


1 /2n— 
(6.1) 247) = Ce "); n>1. 


g2n—r n 


In particular 28) = χά! = uaz and 
(6.2) 2 = 2) > 2) > 2 ς΄ 


In words (6.2) states that, independently of the duration 2n of the game, it 18 more 
likely that no return or exactly one return to zero has occurred than any other number. 


Proof. We recall that by formulas (4.4) and (4.5) there exist exactly as many 
paths of length 2» with no return to zero as there are paths with a return to zero at 
the last step. Consider now paths of length 2n in which the rth and last return 
occurred at some time 2n — 2v < 2n. The section of length 2» starting at this last 
return can be chosen in 85 many ways as we can choose an alternative section 
starting at the same point (2n —2p», 0) of the z-axis and leading to (2n, 0). In other 
words: The probability that exactly r returns to zero occur before tume 2n equals the 
probability that a return occurs at time 2n and that it is preceded by at least r returns. 
By theorem 3 of section 4 this means that 


(6.3) Zin = fin + fin? + fin? +... 
with 3% given by (4.11). It is easily verified that 

1 2n— y 1 2n—-y—l 
om tin = Qen—e ( nN ) i aaa nN ) 


and adding for y = r, r+1, ... we get equation (6.1) as asserted. The assertion 
(6.2) being a trivial consequence, the theorem is proved. 


It is again desirable to replace the exact formula (6.1) by a simpler approxima- 
tion. For that purpose we rewrite (6.1) in the form 


(6.5) 5) ag G-5)(@-2)--G-) 
; On ee ge προ 
(: -;) (: -=)...(1 -"=) 


As was pointed out in the proof of the arc sine law, we have μρη(πη)δ + Lasn > ». 
From the Taylor expansion of the logarithm, II(8.10), we see that log (1 — v/n) 
may be approximated by —»v/n with an error of the order of magnitude (v/n)”. It 
follows that with an error of the magnitude r°/n? we have the approximation 


17-1 r? 
(6.6) log {z8?arint} = -οοοΟΣν 59 --- -- 

2 Vm 4n 
or 


(6.7) | 2) xs inte 7/40, 
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The probability of fewer than k returns, namely 29) 4+ 28) +...4+ 2- is thus 
approximated by a Riemann sum to the integral over x~!e~**” extended from 0 to 


k/n, the relative (or percentage) error involved being of the order of magnitude 
k?/n*. We have thus | 


Theorem 2. For each fixed a > 0 the probability that up to and including time 
2n the particle returns to the origin fewer than a(2n)* times tends as n — © to ὃ 


(6.8) fla) = (2/n)t J “δ 26 de. 


In particular, the probability that there occur fewer than 0.6745(2n)* returns is, for 
large n, approximately %. 


In chapter VII, section 1, the reader will find a table of the normal distribution 
function Φία) = ${1 +f(a)}; from it the values f(z) may be obtained using 
fla) = 2{@(a) — 4} fora > 0. 


Let a coin be tossed 10,000 times: with probability Σ there will be fewer than 68 
returns to zero, of which only one-half represent actual changes of the lead. In 
other words, with probability 4 the mean duration of a “wave” between two con- 
secutive changes of lead is about 300. For 1,000,000 tossings the median number 
of returns has increased only by a factor 10, and the mean duration of a wave has 
increased to about 3000. The longer the series of trials, the rarer the returns to 
zero and the longer the waves. 

The probability that in 10,000 tossings of a coin the lead never changes is about 
0.0085, and with the same probability there will be fewer than 10 changes of lead 
in 1,000,000 tossings. 


7. AN EXPERIMENTAL ILLUSTRATION 


Figure 5 represents the result of an experiment simulating 10,000 
tosses of a coin; it is the material tabulated in example I(6.c). The 
top line contains the graph of the first 550 trials, and the next two 
lines represent the entire record of 10,000 trials on a smaller scale in 
the x-direction. The scale in the y-direction is the same on the two 
graphs. 

When looking at the graph most people feel surprised by the length 
of the waves between successive crossings of the x-axis (i.e., successive 
changes of lead). Nevertheless, the graph represents a comparatively 
mild case history and was chosen as the mildest among three available 
records. The reader is asked to look at the same graph in the reverse 
direction, that is, to take the terminal point as origin. [Analytically, 
the reversed path is given by (2.2).] Theoretically, the series as 
graphed and the reversed series are equivalent, and each represents a 


8 Readers acquainted with the central limit theorem are warned that the num- 
ber of returns is not normally distributed. In (6.8) there appears a truncated nor- 
mal distribution with mean (2/x)? and variance 1 — 2/z. 
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random walk. The reversed random walk has the following charac- 
teristics. Starting from the origin 


the “‘particle’”’ stays at the 


negative side positive side 
first 7804 steps next 8 steps 
next 2 steps next 54 steps 
next 30 steps next 2 steps 
next 48 steps next 6 steps 


next 2046 steps 


total of 9930 steps total of 70 steps 
fraction of tume: 0.9930 fraction of time: 0.007 


This looks absurd, and yet the probability that in 10,000 tosses of a 
perfect coin the lead is on one side for more than 9930 trials and at 
the other for fewer than 70 trials is slightly greater than 0.1. In other 
words, on the average more than one record out of ten will look worse 
than the one just described. By contrast, the probability of a record 
showing a better balance of leads than that of figure 5 is smaller, 
namely about 0.072. 

The record of figure 5 contains 142 returns to the origin among which 
there are 78 actual changes of lead. The reversed series described 
above contains 14 returns of which 8 are changes of lead. Sampling 
of expert opinion has revealed that even trained statisticians feel that 
142 equalizations in 10,000 tosses of a coin is a surprisingly small num- 
ber, and 14 appears quite out of bounds. Actually the probability of 
more than 140 equalizations is about 0.157 while the probability of fewer 
than 14 equalizations is about 0.115. Thus, contrary to intuition, find- 
ing only 14 equalizations is not surprising at all; as far as the number 
of changes of lead is concerned, the reversed series stands on a par with 
the original series of figure 5. 


8. MISCELLANEOUS COMPLEMENTS 
(a) Analytical Verification of Identities 
It is easily verified that 


(8.1) um = (ἢ) Jan = Ὁ». (?), 


n 


The basic identity (4.10) can now be regarded as a special case of equa- 
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tion II(12.9) for a = 4,6 = —4. The same formula shows in addition 
that > U2;,U9n-27r = 1. 
γξεῦ 


Formula (2.8) may be rewritten in terms of fo, instead of Lox+2 
and reduces to the special case of II(12.9) fora =b= 4. Alter- 
natively, formula (2.8) may be derived from (4.10) using the identity 
nr ἴ(α — τ) = rt t+ (n—7). 


(b) The Position of the Maxima: The Second Are Sine Law 


We shall say that the path 81, So, ..., 82} has its first maximum at 
the place k τῇ 
(8.2) 8. >0, sp>S1, «+, Se > Seta, 82 Sh41, ...) 81 Φ Sz. 


In particular, the first maximum is at the place 0 if s; < Oforl <j <2. 
By formula (4.6) the probability that a path of length x = 2n has its 
first maximum at 0 equals wen. It follows that also for a path of length 
x = 2n — 1 the probability of the first maximum at 0 equals wen. 

The event ‘‘first maximum at the last place’ is the same as 8; < 85 
forj = 0,1,...,2—1. For the reversed path (2.2) this means s,* > 0, 
8o* > 0, ..., &* > 0, and the probability of this is given by (4.5), 
namely dan for x = 2n and also for x = 2n + 1. 

A path of length 2n with a first maximum at k consists of two sec- 
tions: The initial section has its first maximum at the last, or kth, 
place, and the second section has its first maximum at the initial, or 
zero-th, place. Conversely, any two sections with the stated proper- 
ties may be combined to give a path with its first maximum at the kth 


place. We have thus the 


Theorem. The probability that a path of length 2n has its first maxi- 
mum at the place v equals 
if y= 2k (k = 1,2,..., 7) 


8.3 2UaKUen— 
( ) 2 U2k 2 2k or y=2k+1 (k = 0, 1, ...,n—1) 


and up of » = 0. 


The remarkable fact is that the probability of finding the first maxt- 
mum at either 2k or 2k + 1 equals the probability pox.on in (5.1) that the 
particle spends 2k out of 2n time units on the positive side. It follows 
that the arc sine approximation applies and we can conclude that there 
is a strong tendency for the maxima to occur near one or the other of the 
endpoints. 

The surprising circumstance that the probability distribution 
{Dok,.en} οὗ leads and the distribution of the position of the maxima are 
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practically the same is no peculiarity of the coin-tossing game. An 
analogous theorem has been proved by E. Sparre Andersen for a large 
class of random variables, and the combinatorial basis of his proof is 
similar to the argument used above. 


(c) A Limit Theorem for First Passages and Returns to the 
Origin ° 
The estimates used in section 6 may be used to show that for fixed y > 0 the 
probability f§) of (4.11) satisfies the asymptotic relation 


(8.4) sy) me eh eats ὁ —W/Gn— 9) 
π,ΆὟ (2n — y)? 


the sign ~ indicating that the ratio of the two sides tends to unity as n — οὐ, 
The methods employed for the limit theorems in sections 5 and 6 now lead to the 
following conclusion: The probability that the yth return to zero (or the first passage 
through y) takes place before time ty” tends, with increasing y, to 1 — f(t) with f(a) 
defined in equation (6.8). 

It follows that with probability near 4 the yth return to zero will occur after time 
(2.21...)y”, so that the average time between consecutive returns is bound to increase 
roughly linearly with y. This should come as a surprise to physicists accustomed 
to taking the average of y “‘measurements on the same quantity” as approximation 
to the “true” value. In the present case a closer analysis reveals that in all likeli- 
hood one among the y measurements will be of the same order of magnitude as the 
whole sum, namely γῆ. 


9 This is theorem 3 of chapter XII, section 5, in the first edition. Advanced 
pare are advised that 1 — f(t—*) is the so-called positive stable distribution of 
order 2. 


CHAPTER IV* 


Combination of Events 


This chapter is concerned with events which are defined in terms 
of certain other events A,, 49, ..., Aw. Thus in bridge the event A, 
“at least one player has a complete suit,” is the union of the four 
events A;, “player number k has a complete suit’? (k = 1, 2, 3, 4). 
Of the events A; one, two, or more can occur simultaneously, and, 
because of this overlap, the probability of A is not the sum of the four 
probabilities P{A;}. Given a set of events Aj, ..., Aw, we shall 
show how to compute the probabilities that 0, 1, 2, 3, ... among them 
occur. 

The material of this chapter is covered in a monograph by M. 
Fréchet,! to which the reader is referred for further information. 


1. UNION OF EVENTS 


If A, and 4.2 are two events, then A = 4: U 4.2 denotes the event 
that either A, or A, or both occur. By formula I (7.4) we have 


(1.1) P{A} = P{A,} + P{Ag} — P{ Ai Ag}. 


We want to generalize this formula to the case of N events Aj, Ag, ..., 
Ay; that is, we wish to compute the probability of the event that at 
least one among the A; occurs. In symbols this event is 
A=A,UA,U...U Ay. For our purpose it is not sufficient to 
know the probabilities of the individual events Az, but we must be 
given complete information concerning all possible overlaps. This 
means that for every pair (7,7), every triple (2,7, k), etc., we must 
know the probability of A; and Aj, or A,, A;, and Ax, etc., occurring 
simultaneously. For convenience of notation we shall denote these 


* The material of this chapter will not be used explicitly in the sequel. Only 
the first theorem is of considerable importance. 
1Les probabilités associées ἃ un systéme d’événements compatibles et dépen- 
dants, Actualités scientifiques et industrielles, nos. 859 and 942, Paris, 1940 and 1943. 
88 
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probabilities by the letter p with appropriate subscripts. Thus 
(1.2) p;=P{As}, pig =P{AiAj}, ρμ = P{A:AjAx}, .... 


The order of the subscripts is irrelevant, but for uniqueness we shall 
always write the subscripts in increasing order; thus, we write p3,7,11 and 
not p7,3,11. Two subscripts are never equal. For the sum of all p’s 
with r subscripts we shall write S,, that is, we define 


(1.3) ΝΗ ΞΞ ΣΡι, Se = Lp; 5, S3 πε =Di,j,ky eee 
Here i <j <k <...< Ν, so that in the sums each combination ap- 


pears once and only once; hence S, has (") terms. The last sum, Sy, 
r 


reduces to the single term 7;.2.3,...,v, which is the probability of the 
simultaneous realization of all N events. For N = 2 we have only the 
two terms S,; and Sg, and formula (1.1) can be written 


(1.4) P{A} = δὲ — So. 


The generalization to an arbitrary number Ν᾽ οὗ events is given in the 
following 


Theorem. The probability P, of the realization of at least one among 
the events Ai, Ao, ..., Aw 18 gwen by 


(1.5) P, = 8; —Se +83 - δὲ + —...+ Sy. 


Proof. We prove (1.5) by the so-called method of inclusion and ex- 
clusion (cf. problem 26). To compute P, we should add the proba- 
bilities of all sample points which are contained in at least one of the A,, 
but each point should be taken only once. To proceed systematically 
we first take the points which are contained in only one A;, then those 
contained in exactly two events A;, and so forth, and finally the points 
(if any) contained in all A;. Now let E be any sample point contained 
in exactly n among our N events A;. Without loss of generality we 
may number the events so that E is contained in Aj, Ao, ..., An but 
not contained in An41, Anse, --.-, An. Then P{E} appears as a contri- 
bution to those p;, p:;, Dijk, ... Whose subscripts range from 1 to n. 


Hence P{E} appears n times as a contribution to S,, and a times as 


a contribution to Se, etc. In all, when the right-hand side of (1.5) is 
expressed in terms of the probabilities of sample points we find P{E} 
with the factor 


(1.6) ε-(γε() - τ (Ὁ 
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To prove the theorem we have to show that this number equals 1. 
This follows at once on comparing (1.6) with the binomial expansion 
of (1 — 1)” [cf. formula II(8.7)]. The latter starts with 1, and the 
terms of (1.6) follow with reversed sign. Hence for every n > 1 the 
expression (1.6) equals 1, and this proves the theorem. 


Examples. (a) In a game of bridge let A; be the event “player 
52 

number ὁ has a complete suit.” Then p; = 4/ () ; the event that 

both player 2 and player 7 have complete suits can occur in 4-3 ways 


| 52\ (39 
and has probability p;,; = 12/ (:) al > similarly we find 


mise 2/15) (1) (i) 


Finally, p1,2,3,4 = P1,2,3, Since whenever three players have a complete 
suit so does the fourth. The probability that some player has a com- 
plete suit is therefore P,; = 4p; — 6p1,2 + 491,2,3 — P1234. Using 
Stirling’s formula, we see that P,; = Σ 10. 19 approximately. In this 
particular case P; is very nearly the sum of the probabilities of A;, but 
this is the exception rather than the rule. 

(δ) Matches (coincidences). ‘The following problem with many vari- 
ants and a surprising solution goes back to Montmort (1708). It has 
been generalized by Laplace and many other authors. 

Two equivalent decks of N different cards each are put into random 
order and matched against each other. If a card occupies the same 
place in both decks, we speak of a match (coincidence or rencontre). 
Matches may occur at any of the N places and at several places simul- 
taneously. This experiment may be described in more amusing forms. 
For example, the two decks may be represented by a set of N letters 
and their envelopes, and a capricious secretary may perform the random 
matching. Alternatively we may imagine the hats in a checkroom 
mixed and distributed at random to the guests. A match occurs if a 
person gets his own hat. It is instructive to venture guesses as to how 
the probability of a match depends on N: How does the probability of 
a match of hats in a diner with 8 guests compare with the correspond- 
ing probability at a gathering of 10,000 people? It seems surprising 
that the probability is practically independent of N and roughly 2. 
(For less frivolous applications cf. problems 10 and 11.) 

The probabilities of having exactly 0, 1, 2, 3, ... matches will be 
calculated in section 4. Here we shall derive only the probability P, of 
at least 1 match. For simplicity of expression let us renumber the 
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cards 1, 2, ..., N in such a way that one deck appears in its natural 
order, and assume that each permutation of the second deck has 
probability 1/N! Let A, be the event that a match occurs at the 
kth place. This means that card number k is at the kth place, and 
the remaining N — 1 cards may be in an arbitrary order. Clearly 
pe = (N — 1)!/N! = 1/N. Similarly, for every combination 1, j, we 
have ρι; = (N — 2)!/N! = 1/N(N — 1), ete. The sum S, contains 


(") terms, each of which equals (V — r)!/N!. Hence S, = 1/r!, and 


r 
from (1.5) we find the required probability to be 


1.7 P 1 : : : 


Note that 1 — P, represents the first N + 1 terms in the expansion 
1 
(1.8) et=1-1+—-—+-—-—-tH+.... 


Therefore we have with a good approximation 
(1.9) P, ~1—e7! = 0.63212.... 


The degree of approximation is shown in the following table of correct 
values of Pi: 


N= 3 4 5 6 7 
P; = 0.66667 0.62500 0.63333 0.63196 0.63214 


2. APPLICATION TO THE CLASSICAL OCCUPANCY 
PROBLEM 


We now return to the problem of a random distribution of r balls in 
n cells, assuming that each arrangement has probability η΄. We seek 
the probability p,,(r, n) of finding exactly m cells empty.” 

Let A; be the event that cell number k is empty (k = 1, 2, ..., n). 
In this event all r balls are placed in the remaining ἢ — 1 cells, and 
this can be done in (n — 1)” different ways. Similarly, there are 
(n — 2)" arrangements, leaving two preassigned cells empty, etc. 
Accordingly 


I\" ONT 3\" 
(2.1) p=l(l—-—-):> py =ll—-—-)°? Dik = ἜΣ 
n n n 


2 This probability has been derived, by an entirely different method, in problem 
II (11.8). Compare also the example in section 3. 
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and hence for every ν “ἢ 


on ἀτρ (- 


The probability that at least one cell is empty is given by (1.5), and 
hence the probability that all cells are occupied is 1 — Sy + Sp — +... 
or 


(2.3) Po(7, 2) = Σ (— 1)" ( (: 7 *). 


v=0 v 


Consider now a distribution in which exactly m cells are empty. These 
n 

m cells can be chosen in ( ) ways. The r balls are distributed among 
m 


the remaining n — m cells so that each of these cells is occupied; the 
number of such distributions is (n — m)"po(r, n—m). Dividing by n’ 
we find for the probability that exactly m cells remain empty 


n m\" 
ay. peas (1 Ξ “Ἵ ΡΞ 


i? n—m m+ v\" 
Σου 05: Ὁ} 
MS y=0 v 7 
We have already used the model of r random digits to illustrate the 
random distribution of r things in ἢ = 10 cells. Empty cells corre- 
spond in this case to missing digits: if m cells are empty, 10 — m dif- 


ferent digits appear in the given sequence. Table 1 provides a nu- 
merical illustration. 


TABLE 1 


PROBABILITIES D,(r, 10) accorpING τὸ (2.4) 


m r= 10 r= 18 

0 0.000 363 0.134 673 
1 .016 330 980 289 
2 .136 080 342 987 
3 305 622 .119 425 
4 345 144 .016 736 
5 .128 596 .000 876 
6 .017 189 000 014 
7 .000 672 .000 000 
8 .000 005 .000 000 
9 .000 000 .000 000 


Mm(r, 10) is the probability that exactly m of the digits 0, 1, ..., 9 will not 


appear in a sequence of r random digits. 
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It is clear that a direct numerical evaluation of (2.4) is limited to 
the case of relatively small n andr. On the other hand, the occupancy 
problem is of particular interest when ἢ is large. If 10,000 balls are 
distributed in 1000 cells, is there any chance of finding an empty cell? 
In a group of 2000 people, is there any chance of finding a day in the 
year which is not a birthday? Fortunately, questions of this kind can 
be answered by means of a remarkably simple approximation with an 
error which tends to zero as ἢ — ©. This approximation and the 
argument leading to it are typical of many limit theorems in probability. 

Our purpose, then, is to discuss the limiting form of the formula (2.4) 
asn — © andr — ©. The relation between r and n is, in principle, 
arbitrary. However, the ratio r/n represents the average number of 
things per cell. If it is excessively large, then we cannot expect any 
empty cells; in this case po(r, m) is near unity and all p,,(r,m) with 
m > 1 are small. On the other hand, if r/n tends to zero, then prac- 
tically all cells must be empty, and in this case pm(r7,n) — 0 for every 
fixed m. Therefore only the intermediate case is of real interest. 

We begin by estimating the quantity S, of formula (2.2). Since 
(n — v)” < (n), <n’, we have clearly 


y yr v\t 
on wir enscw(i-Zy. 
n Nn 


Using the double inequality II(8.12) with ¢ = »/n, we get 
(2.6) {ne PtH yr < WIS, < {new}. 
Now put for abbreviation 

(2.7) neti” = ) 


and suppose that 7 and n increase in such a way that λ remains bounded. 
Then, for each fixed ν, the ratio of the extreme members in (2.6) tends 
to unity, and we conclude that 


, ] 
(2.8) δ᾽ « -- and —\’ —S, — 0. 
y! yp! 
It follows that 
| = λ' 
(2.9) por, n) — Dd, (l= = 
v=-0 bd 


or po(r,n) — οὶ — 0. Now the factor of po(r, n—m) in (2.4) may 
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be rewritten as S,,, and we have therefore for each fixed m 


m 


λ 
(2.10) Dlr, n) -- e *— - 0. 
m! 


This completes the proof of the 


Theorem.’ If n and r tend to infinity so that \ = ne!” remains 
bounded, then (2.10) holds for each fixed m. 


The approximating expressions 
m 


(2.11) p(m;r) = 6" --- 
m! 
define the so-called Poisson distribution, which is of great importance 
and describes a variety of phenomena; it will be studied in chapter VI. 
In practice we may use p(m;X) as an approximation whenever 7 is 
great. For moderate values of n an estimate of the error is required, 
but we shall not enter into it. 


Examples. (a) Table 2 gives the approximate probabilities of find- 
ing m cells empty when the number of cells is 1000 and the number of 
balls varies from 5000 to 9000. For r = 5000 the median value of the 
number of empty cells is six: seven or more empty cells are about as 
probable as six or fewer. Even with 9000 balls in 1000 cells we have 
about one chance in nine to find an empty cell. 

(Ὁ) In birthday statistics [example II(3.d)] n = 365, and r is the 
number of people. For r = 1900 we find A = 2, approximately. Jn a 
village of 1900 people the probabilities Pin of finding m days of the year 
which are not birthdays are approximately as follows: 


Pio) = 0.135, Py = 0.271, Py = 0.271, Ρμῃ = 0.180, 
Pry = 0.090, Ρμ] = 0.036, Pie) = 0.012, Pry = 0.008. 


The probability of finding exactly m cells each containing exactly k 
balls can be derived in the same way. As von Mises has shown, this 
probability can again be approximated by the Poisson expression (2.11), 
only this time ἃ must be defined by 


k 
(2.12) λ = nen () 7.1. 


n 


’ Due (with a different proof) to R. von Mises, Uber Aufteilungs- und Besetzungs- 
wabrscheinlichkeiten, Revue de la Faculté des Sciences de l’ Université d’Istanbul, 
N.S., vol. 4 (1939), pp. 145-163. 


9ὅ 
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3. THE REALIZATION OF m AMONG N EVENTS 
The theorem of section 1 can be strengthened as follows. 
Theorem. For any integer m with 1 <m< WN the probability Pin 


that exactly m among the N events Ay, ..., An occur simultaneously 18 
gwen by 


ol 
(3.1) Piel = Sai as ("* ) Sms 2 


m+ 2 
PY gig ee ee i 
m m 


Note: According to (1.5), the probability Pio) that none among the 
A; occurs is 


(3.2) Po = 1 -- Pi = 1 —S8, +8e — 83 +... F Sy. 


This shows that (3.1) gives the correct value also for m = 0 provided 
we put So = 1. 


Proof. We proceed as in the proof of (1.5). Let # be an arbitrary 
sample point, and suppose that it is contained in exactly n among the 
N events A;. Then P{E} appears as a contribution to Pim only if 
n =m. To investigate how P{E} contributes to the right side of (3.1), 
note that P{#} appears in the sums Sj, So, ..., Sx but not in δ᾽,..ε1, 
..., Sy. It follows that P{#} does not contribute to the right side in 
(3.1) if n « μι. If n =m, then P{E} appears in one and only one 
term of S,,. To complete the proof of the theorem it remains to show 
that for n > m the contributions of P{E} tothe terms S,,, S41, ..-,Sn 
on the right in (3.1) cancel. Now out of the n events containing E we 


n 
can form oo k-tuplets; hence P{ #} appears in Sz with the factor (1 ᾿ 


For n > m the total contribution of P{#} to the right side in (3.1) is 
therefore 


a (") 7 (ς εν ' a gag fee nee 
However, pee ἢ & ) = μ᾿ (΄ ᾿ ") , and hence (3.3) re- 


duces to 


em (HC ον + πὶ 
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Within the braces we have the binomial expansion of (1 — 1)"~” so 


that (3.3) vanishes, as asserted. 


Example. The reader is asked to verify that a substitution from 
formula (2.2) into (3.1) leads directly to formula (2.4). 
4. APPLICATION TO MATCHING AND GUESSING 


In example (1.b) we considered the matching of two decks of cards 
and found that S, = 1/k!. Substituting into (3.1), we find the follow- 
ing result. 


In a random matching of two equivalent decks of N distinct cards the 
probability Pim of having exactly m matches is given by 


1 1 1 1 
ΣΙ Ἐπ ee eee ΟΝ ἐπ 
Ὁ rr (N—2)! (N—1)! M! 
(4.1) P 1-—1+ + ΞΕ Ἔ : 
δῷ 83.  ΠΠ(ᾳΝ-- 2! (Ν -- 1)! 
Ξ ma ἜΜ ᾿ς δ νος, «ἢ | 
ΣΙ 2 83. °° (N—83)! (Ν-- 2)! 
7» =H 1+ ++ : 
δ 2] 3! (N — 3)! 
P I +5 
ἢ 2! 
a eee οι ἢ» Se 
“ae (WD! - NI 


The last relation is obvious. The vanishing of Pjw—y expresses the 
impossibility of having N — 1 matches without having all N cards in 
the same order. 

The braces on the right in (4.1) contain the initial terms of the expan- 
sion of ο΄. For large N we have therefore approximately 


1 
(4.2) Pint = --- ae 
m! 
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TABLE 3 


PROBABILITIES OF m Correct GUESSES IN CALLING A Deck or N 
Distinct CARDS 


— 
SY | A | tts) isa Sy far PP 


0 | 0.333 0.296 | 0.375 0.316 | 0.367 0.328 | 0.368 0.335 | 0.36788 0.34868 | 0.367879 
1; .500 .444| .388 .422/ 375 .410| .3867 .402| .36788 .38742| .367879 
2} ... .222] .250 .211] .167 .205| .187 .201| .18394 .19371] .183940 
8] .167 .0387| ... 047] .083 .051|] .056 .053] .06131 .05740}] .061313 
4 042 .004/ ... 006] .021 .008] .01534 .01116] .015328 
5 008 00] ... .001} .00306 .00149} .003066 
6 001 0000} .00052 .00014) .000511 
7 00007 .00001 | .000073 
8 00001 ..... .000009 
| (| nL, anc eer tama δόξα .000001 
es, | (| (ere ore ane cre re -000000 


The Pim) are given by (4.1), the b» by (4.4). The last column gives the Poisson 
limits (4.3). 


In table 3 the columns headed Pin) give the exact values of Pim for 
N = 3, 4, 5, 6, 10. The last column gives the limiting values 


(4.3) Pm SS ae 
m! 
The approximation of pm to Pim) is rather good even for moderate 


values of N. 
For the numbers 7p,, defined by (4.3) we have Dp, = e 'Ἷ(Ἔ -- 1 -ὉῈ 


1 ] 
+ τ 4 Ἴ +...) ΞΞ 6116 τ 1. Accordingly, the p; may be interpreted 


as probabilities. Note that (4.3) represents the special case \ = 1 of the 
Poisson distribution (2.11). | 
Formulas (4.1) are useful in testing guessing abilities. In wine tast- 
ing, psychic experiments, etc., the subject is asked to call an unknown 
order of N things, say, cards. Any actual insight on the part of the 
subject will appear as a departure from randomness. To judge the 
amount of insight we must appraise the probability of turns of good 
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luck. Now chance guesses can be made according to several systems 
among which we mention three extreme possibilities. (1) The subject 
sticks to one card and keeps calling it. With this system he is sure to 
have one, and only one, correct guess in each series; chance fluctuations 
are eliminated. (2) The subject calls each card once so that each series 
of N guesses corresponds to a rearrangement of the deck. If this sys- 
tem is applied without insight, formulas (4.1) should apply. (3) A 
third possibility is that N guesses are made absolutely independently 
of each other. There are N% possible arrangements. It is true that 
every person has fixed mental habits and is prone to call certain pat- 
terns more frequently than others, but in first approximation we may 
assume all N™ arrangements to be equally probable. Since m correct 


and N — m incorrect guesses can be arranged in (N -- Ὁ 
m. 


different ways, the probability of exactly m correct guesses is now 


(4.4) Ἐς ΤΣ (5 Ξ 9 " 
: " m NN 


[This is a special case of the binomial distribution and has been derived 
in example II(4.c).] 

Table 3 gives a comparison of the probabilities of success when 
guesses are made in accordance with system (2) or (8). To judge the 
merits of the two methods we require the theory of mean values and 
probable fluctuations. It turns out that the average number of correct 
chance guesses is one under all systems; the chance fluctuations are 
somewhat larger under system (2) than (3). A glance at table 3 will 
show that in practice the differences will not be excessive. 


5. MISCELLANY 


(a) The Realization of at Least m Events 


With the notations of section 3 the probability P» that m or more of 
the events Ay, ..., Aw occur simultaneously is given by 


(5.1) Pu = Pia + Pinay +---+ Pim- 


To find a formula for P,, in terms of S; it is simplest to proceed by 
induction, starting with formula (1.5) and using the recurrence relation 
Pmai = Pm — Pim. We get form > 1 


m 


Sim 
_ +) 41 + 


+1 2 Ν -- 
+(" \Snia— Ὁ δε τονε ) Sw. 


m—l1 1 m— 1 


(5.2) Pm =Sn—- (. 
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It is also possible to derive (5.2) directly, using the argument which 
led to (3.1). 


(b) Further Identities 


The coefficients S, can be expressed in terms of either Pig or P, as 
follows 


(5.3) Pa > (*) Pu 


and 
N = 
(5.4) Ss, = >, ( Ἵ P,. 


k=yvp ν -- Ἰ 


Indication of proof. For given values of Pim) the equations (3.1) 
may be taken as linear equations in the unknowns ὅν, and we have to 
prove that (5.3) represents the unique solution. If (5.3) is introduced 
into the expression (3.1) for Pym, the coefficient of Pyj(m < k < N) 
to the right is found to be 
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If k = m this expression reduces to 1. If k > m the sum is the binomial 
expansion of (1 — 1)*—” and therefore vanishes. Hence the substitu- 
tion (5.3) reduces (3.1) to the identity Pim) = Pim. The uniqueness 
of the solution of (3.1) follows from the fact that each equation intro- 
duces only one new unknown, so that the S, can be computed recur- 
sively. The truth of (5.4) can be proved in a similar way. 


(c) Bonferroni’s Inequalities 


A string of inequalities both for P|, and for P,, can be obtained in 
the following way. {7 in either (3.1) or (5.2) only the terms involving 
Smy Sm4iy +++) Sm4r—1 are retained while the terms involving S++, 
Sm+r+i, »--, Sw are dropped, then the error (1.e., true value minus ap- 
proximation) has the sign of the first omitted term [namely, (—1)"] and 74 
smaller in absolute value. Thus, for r = 1 and r = 2: 


(5.6) Sm "" (m + 1)Sm41 < Pim) < Sin 
and 


(5.7) Sn — MSm+1 - Pm Ξ δ». 
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Indication of proof. To prove the statement for (3.1) it must be 
shown that 


N 
(5.8) bd Ga) a (”) S, 2 0, 
m, 


vot 


for every t. Now use (5.3) to write the left side as a linear combination 
of the Ρμ. Fort “κ᾿ < N the coefficient of Pj, equals 


Eo) Q)= (CEC a) 


k-—-m-— 


1 
The last sum equals ( ) and is therefore positive [problem 


t—m— 
II(12.13)]. For further inequalities the reader is referred to Fréchet’s 
monograph cited at the beginning of the chapter. 


6. PROBLEMS FOR SOLUTION 


Note: Assume in each case that all possible arrangements have the same probability. 


1. Ten pairs of shoes are in a closet. Four shoes are selected at random. 
Find the probability that there will be at least one pair among the four shoes 
selected. 

2. Five dice are thrown. Find the probability that at least three of them 
show the same face. (Verify by the methods of chapter II, section 5.) 

3. Find the probability that in five tossings a coin falls heads at least three 
times in succession. 

4. Solve problem 3 for a head-run of at least length five in ten tossings. 

5. Solve problems 3 and 4 for ace runs when a die is used instead of a coin. 

6. Two dice are thrown r times. Find the probability p, that each of the 
six combinations (1, 1), ..., (6, 6) appears at least once. 

7. Quadruples in a bridge hand. By a quadruple we shall understand four 
cards of the same face value, so that a bridge hand of thirteen cards may con- 
tain 0, 1, 2, or 3 quadruples. Calculate the corresponding probabilities. 

8. Sampling with replacement. A sample of size r is taken from a popula- 
tion of n people. Find the probability u, that N given people will all be in- 
cluded in the sample. [This is problem II(11.12).] 

9. Sampling without replacement. Answer problem 8 for this case and show 
that 8 holds with u, > p%. (This is problem II(11.3), but the present method 
leads to an entirely different formula.) 

10. In the general expansion of a determinant of order N the number of 
terms containing one or more diagonal elements is N!P, with P; defined by (1.7). 

11. The number of ways in which 8 rooks can be placed on a chessboard 
so that none can take another and that none stands on the white diagonal is 
81(1 — Pi), where P; is defined by (1.7) with N = 8. 
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12. A ‘sampling (coupon collector's) problem. A pack of cards consists of 8 
identical series, each containing n cards numbered 1, 2, ..., n. A random 
sample of r > n cards is drawn from the pack without replacement. Calcu- 
late the probability u, that each number is represented in the sample. (Applied 
to a deck of bridge cards we get for s = 4, n = 13 the probability that a hand 
of r cards contains all 13 values; and for s = 18, n = 4 we get the probability 
that all four suits are represented.) 

13. Continuation. Show that as s — © one has u, — po(r, Ὁ) where the 
latter expression is defined in (2.3). This means that in the limit our sampling 
becomes random sampling with replacement from the population of the num- 
bers 1, 2, ..., 7. 


14. Continuation. From the result of problem 12 conclude that 
n a k n = τ 
Σ 1) (2) (ns — ks), = 0 
ifr <nand forr=n 
> (Ξ}} (4) (ns — ks), = snl. 
k=0 k 


Verify this by evaluating the rth derivative, at x = 0, of 


eae {1 ry (1 =e cA he ee 


15. In the sampling problem 12 find the probability that it will take exactly 
r drawings to get a sample containing all numbers. Pass to the limit ass —> οὐ. 


16. A cell contains N chromosomes, between any two of which an interchange 
of parts may occur. If r interchanges occur (which can happen in (3) 


distinct ways), find the probability that exactly m chromosomes will be in- 
volved. 


17. Find the probability that exactly k suits will be missing in a poker hand. 


18. Find the probability that a hand of thirteen bridge cards contains the 
ace-king pairs of exactly & suits. 


19. Multiple matching. Two similar decks of N distinct cards each are 
matched simultaneously against a similar target deck. Find the probability 
Um Of having exactly m double matches. Show that w — las N — οο (which 
implies that wu, — Ὁ for m > 1). 

20. Multiple matching. The procedure of the preceding problem is modified 
as follows. Out of the 2N cards N are chosen at random, and only these N 
are matched against the target deck. Find the probability of no match. Prove 
that it tends to 1/eas N > ». 


21. Multiple matching. Answer problem 20 if r decks are used instead of 
two. 


*For N = 6 see D. G. Catcheside, D. E. Lea, and J. M. Thoday, Types of 
chromosome structural change introduced by the irradiation of tradescantia micro- 
spores, Journal of Genetics, vol. 47 (1945-46), pp. 118-149. 
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22. In the classical occupancy problem, the probability Pimj(k) of finding 
exactly m cells occupied by exactly & things is 


_ (=1"nlr! (ΞΕ 
Peo) = air 2 Gali = aye = ED! 


the summation extending over those 7 > m for which j < n and kj < r. 
23. Prove the last statement of section 2 for the case k = 1. 


24. Using (3.1), derive the probability of finding exactly m empty cells in 
the case of Bose-Einstein statistics. 


25. Verify that the formula obtained in 24 checks with formula II(11.14). 
26. Prove formula (1.5) by induction on N. 


CHAPTER V 


Conditional Probability. 
Stochastic Independence 


1. CONDITIONAL PROBABILITY 


The notion of conditional probability is a basic tool of probability 
theory, and it is unfortunate that its great simplicity is somewhat ob- 
scured by a singularly clumsy terminology. The following considera- 
tions lead in a natural way to the formal definition. 


Preparatory Examples 


Suppose a population of N people includes N4 colorblind people and 
Nua females. Let the events that a person chosen at random is color- 
blind and a female be A and H, respectively. Then (cf. the definition 
of random choice, chapter II, section 2) 


N N 
(1.1) P{A} = aa P{H} = 


Instead of the entire population, we may investigate the female sub- 
population and require the probability that a female chosen at random 
be colorblind. This probability is ΝΑ ΝΕ, where Nya is the number 
of colorblind females. We have here no new notion, but we need a new 
notation to designate which particular subpopulation is under investi- 
gation. The most widely adopted symbol is P{A|H}; it may be read 
“the probability of the event A (colorblindness), assuming the event 1 
(that the person chosen is female).” In symbols: 
Nau P{AB} 


Obviously every subpopulation may be considered as a population 
in its own right; we speak of a subpopulation merely for convenience 
104 
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of language to indicate that we have a larger population in the back of 
our minds. An insurance company may be interested in the frequency 
of damages of a fixed amount caused by lightning (event A). Presuma- 
bly this company has several categories of insured objects such as in- 
dustrial, urban, rural, etc. Studying separately the damages to indus- 
trial objects means to study the event A only in conjunction with the 
event H—‘‘Damage is to an industrial object.””’ Formula (1.2) again 
applies in an obvious manner. Note, however, that for an insurance 
company specializing in industrial objects the category H coincides 
with the whole sample space, and P{A|H} reduces to P{A}. 

Finally consider the bridge player North. Once the cards are dealt, 
he knows his hand and is interested only in the distribution of the re- 
maining 39 cards. It is legitimate to introduce the aggregate of all 
possible distributions of these 39 cards as a new sample space, but it 
is obviously more convenient to consider them in conjunction with the 
13 cards in North’s hand (event H) and to speak of the probability of 
an event A (say South’s having two aces) assuming the event H. For- 
mula (1.2) again applies. 


By analogy with (1.2) we now introduce the formal 


Definition. Let H be an event with positive probability. For an arbi- 
trary event A we shall write 


P{AH} 
P{H} 


(1.3) P{A|H} = 


The quantity so defined will be called the conditional probability of A on 
the hypothesis H (or for given H). When all sample points have equal 
probabilities, P{A|H} is the ratio ΝΑ ΝᾺ of the number of sample 
points common to A and H, to the number of points in H. 


Conditional probabilities remain undefined when the hypothesis has 
zero probability. This is of no consequence in the case of discrete 
sample spaces but is important in the general theory. 

Though the symbol P{A |} itself is practical, its phrasing in words 
is so unwieldy that in practice less formal descriptions are used. Thus 
in our introductory example we referred to the probability of a female’s 
being colorblind instead of saying ‘“‘the conditional probability of a ran- 
domly chosen person’s being colorblind on the hypothesis that the per- 
son is a female.’’ Often the phrase ‘‘on the hypothesis H”’ is replaced 
by “if it is known that H occurred.” In short, our formulas and sym- 
bols are unequivocal, but phrasings in words are often informal and 
must be properly interpreted. 
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Sometimes for stylistic clarity probabilities in sample space are called 
absolute probabilities in contradistinction to conditional ones. Strictly 
speaking, the adjective “absolute” is redundant and will be omitted. 

Taking conditional probabilities of various events with respect to a 
particular hypothesis H amounts to choosing H as a new sample space; 
we have to multiply all probabilities by the constant factor 1/P{H} in 
order to reduce the total probability of the new sample space to unity. 
This formulation shows that all general theorems on probabilities are valid 
also for conditional probabilities with respect to any particular hypothesis 
H. Asan example we mention the fundamental relation for the proba- 
bility of the occurrence of either A or B or both. We have 


(1.4)  P{A U ΒΗ) = P{A|H} + ΡΙΒΙΗ) — ΡΙΑΒΊΠΗ). 


Similarly, all theorems of chapter IV concerning probabilities of the 
realization of m among N events carry over to conditional probabilities, 
but we shall not need them. 

Formula (1.3) is often used in the form 


(1.5) P{AH} = P{A|H}-P{H}. 


This is the so-called theorem on compound probabilities. To generalize 
it to three events A, B, C we first take H = BC as hypothesis and then 
apply (1.5) once more; it follows that 


(1.6) P{ABC} = P{A|BC}-P{B|C}-P{C}. 


A further generalization to four or more events is straightforward. 

We conclude with a simple formula which is frequently useful. Let 
Hy, ..., Hn be a set of mutually exclusive events of which one neces- 
sarily occurs (that is, the union of Hy, ..., H» is the entire sample 
space). Then any event A can occur only in conjunction with some 
H;, or in symbols, 


(1.7) A = AH, U AH, U...U ΑΗ... 

Since the AH; are mutually exclusive, their probabilities add. Apply- 
ing (1.5) to H = H; and adding, we get 

(1.8) P{A} = ZP{A|H;}-P{H;}. 

This formula is useful because an evaluation of the conditional prob- 


abilities P{ A | H;} is sometimes easier than a direct calculation of P{A}. 


Examples. (a) Sampling without replacement. From a population 
of the n elements 1, 2, ..., ἢ an ordered sample is taken. Let 7 andj 
be two different elements. Assuming that 7 is the first element drawn 
(event H), what is the probability that the second element is 7 (event 
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A)? Clearly P{AH} = 1/n(n — 1) and P{A|H} = 1/(n — 1). This 
expresses the fact that the second choice refers to a population of n — 1 
elements, each of which has the same probability of being chosen. In 
fact, the most natural definition of random sampling is: ‘Whatever the 
first r choices, at the (r+1)st step each of the remaining n — r elements 
has probability 1/(n ~ r) to be chosen.” This definition is equivalent 
to that given in chapter II, but we could not have stated it earlier 
since it involves the notion of conditional probability. 

(Ὁ) Four balls are placed successively into four cells, all 4* arrange- 
ments being equally probable. Given that the first two balls are in 
different cells (event H), what is the probability that one cell contains 
exactly three balls (event A)? Given H, the event A can occur in two 
ways, andsoP{A|H} = 2.4" = 4. (It is easy to verify directly that 
the events H and AH contain 12-4? and 12-2 points, respectively.) 

(c) Distribution of sexes. Consider families with exactly two chil- 
dren. Letting ὃ and g stand for boy and girl, respectively, and the 
first letter for the older child, we have four possibilities: bb, bg, gb, gg. 
These are the four sample points, and we associate probability 4 with 
each. Given that a family has a boy (event H), what is the probability 
that both children are boys (event A)? The event AH means bb, and 
H means bb, or bg, or gb. Therefore, P{A|H} = 4; in about one-third 
of the families with the characteristic H we can expect that A also will 
occur. It is interesting that most people expect the answer to be 3. 
This is the correct answer to a different question, namely: A boy is 
chosen at random and found to come from a family with two children; 
what is the probability that the other child is a boy? The difference 
may be explained empirically. With our original problem we might 
refer to a card file of families, with the second to a file of males. In 
the latter, each family with two boys will be represented twice, and 
this explains the difference between the two results. 

(d) Stratified populations. Suppose a human population consists of 
subpopulations or strata H,, H2,.... These may be races, age groups, 
professions, etc. Let p; be the probability that an individual chosen 
at random belongs to H;. Saying “the probability that an individual 
in H; is left-handed is gq,” is short for ‘the conditional probability of 
the event A (left-handedness) on the hypothesis that an individual be- 
longs to H; 18 ᾳ;." The probability that an individual chosen at ran- 
dom is left-handed is p1q1 + Ῥω. + ps3 +..., which is a special case 
of (1.8). Given that an individual is left-handed, the conditional prob- 
ability of his belonging to stratum ἢ; is 


}74) 


1.9 P{H;|A} -ἜἝ-----...-:..-:---ς- 
oe ies Pid + Pode +... 
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2. PROBABILITIES DEFINED BY CONDITIONAL 
PROBABILITIES. URN MODELS 


In the preceding section we have taken the probabilities in the sample 
space for granted and merely calculated a few conditional probabilities. 
In applications, many experiments are described by specifying certain 
conditional probabilities (although the adjective ‘‘conditional’’ is usu- 
ally omitted). Theoretically this means that the probabilities in sample 
space are to be derived from the given conditional probabilities. It has 
already been pointed out [example (1.a)] that sampling without replace- 
ment is best defined by saying that whatever the result of the r first 
selections, at the (r+1)st step each of the remaining elements has the 
same probability of being selected. Similarly, in example (1.d) our 
stratified population is completely described by stating the absolute 
probabilities p; of the several strata, and the conditional probability 
q; of the characteristic “left-handed”’ within each stratum. A few more 
examples will reveal the general scheme more effectively than a direct 
description could. 


Examples. (a) In example I(5.b) we have considered three players 
a, b, c taking turns at a game; we have described the points of the 
sample space but have not assigned probabilities to them. Suppose 
now that the game is such that at each trial each of the two partners 
has probability ΖΦ of winning. This statement does not contain the 
word ‘‘conditional probability” but refers to it nonetheless. For it says 
that if player a participates in the rth round (event H), his probability 
of winning that particular round is $. It follows from equation (1.5) 
that the probability of a winning at the first and second try is 4, in 
symbols, P{aa} = 4. A repeated application of (1.5) shows that 
P{acc} = $, P{acbb} = +5, etc.; that is, a sample point of the scheme 
(*) involving r letters has probability 2~”. This is the assignment of 
probabilities used in problem 1,5, but now the description is more 
intuitive. (Continued in problem 14.) 

(b) Families. We want to interpret the following statement. ‘The 
probability of a family with exactly & children is p;, (where po + p1 + 
+...= 1). For any family size all sex distributions have equal prob- 
abilities.”” Letting ὃ stand for boy and g for girl, our sample space 
consists of the points 0 (no children), b, g, bb, bg, gb, gg, bbb, .... The 
second assumption in quotation marks can be stated more formally 
thus: If it is known that the family has exactly n children, each of the 
2” possible sex distributions has conditional probability 2~”. The 
probability of the hypothesis is pn, and we see from (1.5) that the 
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absolute probability of any arrangement of n letters ὃ and g is p,-27”. 

Note that this is an example of a stratified population, the families 
of size 7 forming the stratum H;. As an exercise let A stand for the 
event “086 family has boys but no girls.”’ Its probability is obviously 
P{A} = ρι-2. + po-2-? +... which is a special case of (1.8). The 
hypothesis H; in this case is “family has 7 children.”” We now ask the 
question: If it is known that a family has no girls, what is the (condi- 
tional) probability that it has only one child? Here A is the hypothesis. 
Let H be the event “‘only one child.”” Then AH means ‘“‘one child and 
no girl,” and 


- 
(2.1) P{H|A} = ane pe 


P{A} 2} po? + gd? 4+... 7 


which is a special case of (1.9). 

(c) Urn models for aftereffect. For the sake of definiteness consider 
an industrial plant liable to accidents. The occurrence of an accident 
might be pictured as the result of a superhuman game of chance: Fate 
has in storage an urn containing red and black balls; at regular time 
intervals a ball is drawn at random, a red ball signifying an accident. 
If the chance of an accident remains constant in time, the composition 
of the urn is always the same. But it is conceivable that each accident 
has an aftereffect in that it either increases or decreases the chance of 
new accidents. This corresponds to an urn whose composition changes 
according to certain rules that depend on the outcome of the successive 
drawings. It is easy to invent a variety of such rules to cover various 
situations, but we shall be content with a discussion of the following! 


Urn model: An urn contains ὃ black and r red balls. A ball 1s drawn 
at random. It is replaced and, moreover, c balls of the color drawn and ἃ 
balls of the opposite color are added. A new random drawing is made 
from the urn (now containing r+ b+ c-+d balls), and this procedure 
is repeated. Here c and d are arbitrary integers. They may be chosen 
negative, except that in this case the procedure may terminate after 
finitely many drawings for lack of balls. In particular, choosing c = — 1 
and d = 0 we have the model of random drawings without replacement 
which terminates after r + ὃ steps. 


1 The idea to use urn models to describe aftereffects (contagious diseases) seems 
to be due to Polya. His scheme (first introduced in F. Eggenberger and G. Polya, 
Uber die Statistik verketteter Vorgdnge, Zeitschrift fir Angewandte Mathematik and 
Mechanik, vol. 3 (1923), pp. 279-289) served as a prototype for many models dis- 
cussed in the literature. The model described in the text and its three special 
cases were proposed by B. Friedman, A simple urn model, Communications on Pure 
and Applied Mathematics, vol. 2 (1949), pp. 59-70. 
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To turn our picturesque description into mathematics, note that it 
specifies conditional probabilities from which certain basic probabilities 
are to be calculated. A typical point of the sample space corresponding 
to n drawings may be represented by a sequence of n letters B and R. 
The event “black at first drawing”’ (i.e., the aggregate of all sequences 
starting with B) has probability b/(b + 7). Jf the first ball is black, 
the (conditional) probability of a black ball at the second drawing is 
(b+ c)/(b+r+c+d). The (absolute) probability of the sequence 
black, black (i.e., the aggregate of the sample points starting with BB) 
is therefore, by (1.5), 


b b+c 


(2.2) --..--. ---------ς----. 
b+r b+rt+ctd 


The probability of the sequence black, black, black is (2.2) multiplied 
by (ὃ + 2c)/(b + r + 2c + 2d), etc. It is clear that in this way the 
probabilities of all sample points can be calculated. (Of course, in the 
case of a negative c or ἃ the number n of drawings should be chosen 
small enough to avoid negative numbers of balls.) It is easily verified 
by induction that the probabilities of all sample points indeed add to 
unity. 

Explicit expressions for the probabilities are not readily obtainable 
except in the most important and best-known special case, that of 

Polya’s urn scheme which is characterized by d= 0,c > 0. Here 
after each drawing the number of balls of the color drawn increases, 
whereas the balls of opposite color remain unchanged in number. In 
effect the drawing of either color increases the probability of the same 
color at the next drawing, and we have a rough model of phenomena 
such as contagious diseases, where each occurrence increases the prob- 
ability of further occurrences. The analytical simplicity of the Polya 
model is due to the following obvious property: Any sequence of n 
drawings resulting in n,; black and nz red balls (n; + ng = n) has the 
same probability as the event of extracting first n, black and then no 
red balls, namely, 


(2.3) Prin >= 
_ 6 + ο)( + 2c) ++ (b+ nic — c)-r(r +c) τ: (r+ nec — €) 


b+rnbo+rt+c(b+rt+2c)---b+r+ne—c) 


On dividing numerator and denominator by ὁ and using the nota- 
tion II(2.1), this formula may be rewritten in the following ways: 
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b r b r 
Cra). (εν ICS 
c ny \C ne C/ ny C/ no 
b b 
τωνδὶ os 
Cc n Cc n 


(The Polya scheme is discussed in problems 18-24.) 

In addition to the Polya scheme our urn model contains another 
special case of interest, namely the 

Ehrenfest model? of heat exchange between two isolated bodies. In 
the original description, as used by physicists, the Ehrenfest model 
envisages two containers I and II and & particles distributed in them. 
A particle is chosen at random and moved from its container into the 
other container. This procedure is repeated. What is the distribution 
of the particles after n steps? To reduce this to an urn model it suffices 
to call the particles in container I red, the others black. Then at each 
drawing the ball drawn is replaced by a ball of the opposite color, 
that is, we have c = —1,d = 1. It is clear that in this case the proc- 
ess can continue as long as we please (if there are no red balls, a black 
ball is drawn automatically and replaced by a red one). [We shall dis- 
cuss the Ehrenfest model in another way in example XV(2,f).] 

The special case c = 0, d > 0 has been proposed by Friedman as a 
model of a safety campaign. Every time an accident occurs (i.e., a red 
ball is drawn), the safety campaign is pushed harder; whenever δῦ acci- 
dent occurs, the campaign slackens and the probability of an accident 
increases. 

(d) Urn models for stratification. Spurious contagion. To continue 
in the vein of the preceding example, suppose that each person is liable 
to accidents and that their occurrence is determined by random draw- 
ings from an urn. This time, however, we shall suppose that no after- 
effect exists, so that the composition of the urn remains unchanged 
throughout the process. Now the chance of an accident or proneness 
to accidents may vary from person to person or from profession to pro- 
fession, and we imagine that each person (or each profession) has his 
own urn. In order not to complicate matters unnecessarily, let us sup- 
pose that there are just two types of people (two professions) and that 
their numbers in the total population stand in the ratio 1:5. We con- 
sider then an urn 1 containing 7; red and δι black balls, and an urn II 


(2.4) Dan = 


?P. and T. Ehrenfest, Uber zwei bekannte Einwiinde gegen das Boltzmannsche 

H-Theorem, Physikalische Zeitschrift, vol. 8 (1907), pp. 311-314. For a mathe- 
matical dizcusdon see M. Kac, Random walk and the theory of Brownian motion, 
American Mathematical Monthly, vol. 54 (1947), pp. 369-391. 
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containing rz red and ὃς black balls. The experiment ‘‘choose a person 
at random and observe how many accidents he has during n time units”’ 
has the following counterpart: A die is thrown; uf ace appears, choose 
urn I, otherwise urn II. In each case n random drawings with replace- 
ment are selected from the urn. Our experiment describes the situation 
of an insurance company accepting a new subscriber. 

By using (1.8) it is seen that the probability of red at the first draw- 
Ing is 
Lon {8 ἡ 
6 δι -+ ΤΊ 6 bo +- Tg 


(2.5) P{R} = 


and the probability of a sequence red, red 


2.6) P{RR} =~ ( 2 ee ( 2 ) 
@ Π6 Nb tr 6 δ) +r 


No mathematical problem is involved in our model, but it has an 
interesting feature which has caused great confusion in applications. 
Suppose our insurance company observes that a new subscriber has an 
accident during the first year, and is interested in the probability of a 
further accident during the second year. In other words, given that 
the first drawing resulted in red, we ask for the (conditional) proba- 
bility of a sequence red, red. This is clearly the ratio P{RR}/P{R} 
and is different from P{R}. For the sake of illustration suppose that 
γε (δι + 71) = 0.6 and ro/(be + 72) = 0.06. The probability of red at 
any drawing is 0.15, but if the first drawing resulted in red, the chances 
that the next drawing also results in red are 0.42. Note that our model 
involves no aftereffect in the total population, and yet the occurrence 
of an accident for a person chosen at random increases the odds that 
this same person will have a second accident. We have here an effect 
of sampling; the occurrence of an accident does not have a real effect, 
but it is an indication that the person chosen at random has a high 
proneness to accidents. 

In the statistical literature it has become customary to use the word 
contagion instead of aftereffect. The apparent aftereffect of sampling 
was at first misinterpreted as an effect of true contagion, and so statis- 
ticians now speak of contagion (or contagious probability distributions) 
in a vague and misleading manner. Take, for example, the ecologist 
searching for insects in a field. If after an unsuccessful period he finds 
an insect, he might conclude that the litter is likely close by and that his 
chances of finding another insect are good. Obviously no aftereffect is 
involved, and yet the statistician speaks of contagion. 
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(6) The following example is famous and illustrative but somewhat 
artificial. Imagine a population of N + 1 urns, each containing N red 
and white balls; the urn number k contains k red and N — k white 
balls (k = 0,1, 2,..., Ν). An-urn is chosen at random and n random 
drawings are made from it, the ball drawn being replaced each time. 
Suppose that all n balls turn out to be red (event A). We seek the (con- 
ditional) probability that the next drawing will also yield a red ball 
(event B). If the first choice falls on urn number k, then the proba- 
bility of extracting in succession n red balls is (k/N)”. Hence, by (1.8), 


17+ 2” +...+ N” 
2. τ -ς τς 
en a N"(N + 1) 


The event AB means that n + 1 drawings yield red balls, and therefore 
yet po gnti ig 4 nti 

a 

The required probability is P{B|A} = P{B}/P{A}. 


The sums in (2.7) and (2.8) can be considered Riemann sums approx- 
imating integrals, so that when J is large 


: N k\” 1 1 
9 N- — ~f "dx = : 
ie x(x) a Ἶ n+1 


(2.8) P{AB} = P{B} 


We have therefore for large N approximately 


n+1 
n+2 


This formula can be interpreted roughly as follows: If all compositions 
of an urn are equally probable, and if n trials yielded red balls, the 
probability of a red ball at the next trial is (n + 1)/(n + 2). This is 
the so-called law of succession of Laplace (1812). 

Before the ascendance of the modern theory, the notion of equal 
probabilities was often used as synonymous for “no advance knowl- 
edge.” Laplace himself has illustrated the use of (2.10) by computing 
the probability that the sun will rise tomorrow, given that it has risen 
daily for 5000 years or n = 1,826,213 days. It is said that Laplace was 
ready to bet 1,826,214 to 1 in favor of regular habits of the sun, and 
we should be in a position to better the odds since regular service has 
followed for another century. A historical study would be necessary 
to render justice to Laplace and to understand his intentions. His 
successors, however, used similar arguments in routine work and rec- 


(2.10) P{B|A} = 
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ommended methods of this kind to physicists and engineers in cases 
where the formulas have no operational meaning. We should have to 
reject the method even if, for sake of argument, we were to concede 
that our universe was chosen at random from a collection in which all 
conceivable possibilities were equally likely. In fact, it pretends to 
judge the chances of the sun’s rising tomorrow from the assumed risings 
in the past. But the assumed rising of the sun on February 5, 3123 
B.C., is by no means more certain than that the sun will rise tomorrow. 
We believe in both for the same reasons. 


Note on Bayes’s Rule. In (1.9) and (2.2) we have calculated certain condi- 
tional probabilities directly from the definition. The beginner is advised always to 
do so and not to memorize the formula (2.12), which we shall now derive. It re- 
traces in a general way what we did in special cases, but it is only a way of rewriting 
(1.3). We had a collection of events Hi, He, ... which are mutually exclusive and 
exhaustive, that is, every sample point belongs to one, and only one, among the 
H;. We were interested in 


(2.11) P{H,|A} =———. 


If (1.5) and (1.8) are introduced into (2.11), it takes the form 
P{A|Hi.}P{ Hx} 


2.12 Pi,|A} = =————_———: 
a {ΜῈ} AY = Sp (AH, \P LEG} 

i 

If the events Hj, are called causes, then (2.12) becomes ‘‘Bayes’s rule for the proba- 
bility of causes.” Mathematically, (2.12) is a special way of writing (1.3) and 
nothing more. The formula is useful in many statistical applications of the type 
described in examples (6) and (d), and we have used it there. Unfortunately, 
Bayes’s rule has been somewhat discredited by metaphysical applications of the 
type described in example (e). In routine practice this kind of argument can be 
dangerous. A quality control engineer is concerned with one particular machine 
and not with an infinite population of machines from which one was chosen at 
random. He has been advised to use Bayes’s rule on the grounds that it is logically 
acceptable and corresponds to our way of thinking. Plato used this type of argu- 
ment to prove the existence of Atlantis, and philosophers used it to prove the 
absurdity of Newton’s mechanics. But for our engineer the argument overlooks 
the circumstance that he desires success and that he will do better by estimating 
and minimizing the sources of various types of errors in prediction and guessing. 
The modern method of statistical tests and estimation is less intuitive but more 
realistic. It may be not only defended but also applied. 


3. STOCHASTIC INDEPENDENCE 


In the examples above the conditional probability P{A|H} generally 
does not equal the absolute probability P{A}. Popularly speaking, 
the information whether H has occurred changes our way of betting 
on the event A. Only when P{A|H} = P{A} this information does 
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not permit any inference about the occurrence of A. In this case we 
shall say that A is stochastically independent of H. Now (1.5) shows 
that the condition P{A|H} = P{A} can be written in the form 


(3.1) P{AH} = P{A}-P{H}. 


This equation is symmetric in A and H and shows that whenever A is 
stochastically independent of H,sois H of A. It is therefore preferable 
to start from the following symmetric 


Definition 1. Two events A and H are said to be stochastically inde- 
pendent (or independent, for short) if equation (3.1) holds. This defini- 
tion is accepted also if P{H} = 0, in which case P{A | 1} is not defined. 
The term statistically independent is synonymous with stochastically 
independent. 


Examples. (a) A card is chosen at random from a deck of playing 
cards. For reasons of symmetry we expect the events “‘spade’’ and 
‘‘ace’”’ to be independent. As a matter of fact, their probabilities are 
+ and τίς, and the probability of their simultaneous realization is εἶς. 

(b) Two true dice are thrown. The events “ace with first die” and 
“even face with second” are independent since the probability of their 
simultaneous realization, τ = τίς, is the product of their probabilities, 
namely 4 and §. 

(c) Ina random permutation of the four letters (a, ὃ, c, d) the events 
“a precedes δ᾽) and “‘c precedes d@”’ are independent. This is intuitively 
clear and easily verified. 

(d) Sex distribution. We return to example (1.6) but now consider 
families with three children. We assume that each of the eight possi- 
bilities bbb, bbg, ..., ggg has probability }. Let H be the event “‘the 
family has children of both sexes,’ and A the event “there is at most 
one girl.” Then P{H} = $, and P{A} = $. The simultaneous reali- 
zation of A and H means one of the possibilities bbg, bgb, gbb, and 
therefore P{AH} = ὃ = P{A}-P{H}. Thus in families with three 
children the two events are independent. Note that this is not true 
for families with two or four children. This shows that it is not always 
obvious whether or not we have independence. 


If H occurs, the complementary event H’ does not occur, and vice 
versa. Stochastic independence implies that no inference can be drawn 
from the occurrence of H to that of A; therefore stochastic independ- 
ence of A and H should mean the same as independence of A and H’ 
(and, because of symmetry, also of A’ and H, and of A’ and H’). This 
assertion is easily verified, using the relation P{H’} = 1 — P{H}. If 


116 CONDITIONAL PROBABILITY [V.3 
(3.1) holds, then (since AH’ = A — AH) 
(3.2) P{AH’} = P{A} — P{AH} = P{A} — P{A}-P{H} = 


τ" P{A} -P{H’}, 
as expected. 
Suppose now that three events A, B, and C are pairwise independent 
so that 
P{AB} = P{A}-P{B} 


(3.3) P{AC} = P{A}-P{C} 
P{BC} = P{B}-P{c}. 


We might think that this always implies the independence of such 
pairs of events as AB and C. Unfortunately this is not necessarily so. 
We shall exhibit an example in which (3.3) is true but the simultaneous 
occurrence of A, B, and C is impossible, so that AB and C cannot be 
independent. 


Example. (e) Two dice are thrown and three events are defined as 
follows: A means ‘‘odd face with first die’; B means “‘odd face with 
second die’; finally, C means “‘odd sum”’ (one face even, the other odd). 
If each of the 36 sample points has probability πἶδ, then any two of 
the events are clearly independent. The probability of each is 4, and 
so is its conditional probability, assuming that one of the other two 
events has occurred. Nevertheless, the three events cannot occur si- 
multaneously. The information that A but not B has occurred assures 
that C has occurred, and a similar statement holds for all other com- 
binations. 


It is desirable to reserve the term stochastic independence for the 
case where no such inference is possible. Then not only (3.3) must 
hold but in addition 


(3.4) P{ABC} = P{A}P{B}P{C}. 


This equation insures that A and BC are independent and also that 
the same is true of B and AC, and C and AB. Furthermore, it can 
now be proved also that A U B and C are independent. In fact, by 
the fundamental relation I1(7.4) we have 


(3.5) P{(A U B)C} = P{AC} + P{ BC} — P{ABC}. 


Now, applying (3.3) and (8.4) to the right side, we can factor out P{C}. 
The other factor is P{A} + P{B} — P{AB} = P{A U B} so that 


(3.6) P{A U B)C} = P{(A U B)} P{c}. 
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This makes it plausible that the conditions (3.3) and (3.4) together 
suffice to avoid embarrassment; any event expressible in terms of A 
and B will be independent of C. 

In the general case of n events the following definition proves satis- 
factory. | 


Definition 2. The events A1, As, ..., An are called mutually inde- 
pendent wf for all combinations 1 <i <j<k<...< ἢ the multiplica- 
tion rules 


μι 
= 
B 
a 
= 
Ι 
τὰ 
ἊΝ 
Ὁ 
Ss 
"Ὁ 
a 
=. 


(3.7) Bi, ἧς cial aca te, τῆς GP Se. δὲν ends τς. ea ode Sek 
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apply. 


n n 
The first line stands for (") equations, the second for (") ,etc. We 


have, therefore, 


(+(e (ener ()-G)-e aan 


conditions which must be satisfied. On the other hand, the (} con- 


ditions stated in the first line suffice to insure pairwise independence. 
The whole system (3.7) looks like a complicated set of conditions, but 
it will soon become apparent that its validity is usually obvious and 
requires no checking. It is readily seen by induction [starting with 
n = 2 and (3.2)] that 


In definition 2 the system (3.7) may be replaced by the system of the 2” 
equations obtained from the last equation in (3.7) on replacing an arbi- 
trary number of events A; by their complements A’;. 


The distinction between mutual and pairwise independence is of theo- 
retical rather than practical interest. Practical examples of pairwise 
independent events that are not mutually independent apparently do 
not exist. The possibility of such an occurrence was discovered by S. 
Bernstein. 
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4, REPEATED TRIALS 


The notion of stochastic independence finally enables us to formulate 
analytically the intuitive concept of experiments ‘‘repeated under iden- 
tical conditions.” 

Consider the sample space © representing a certain conceptual ex- 
periment. Let the sample points be £,, Ho, ... and denote their 
probabilities by p1, po, .... The possible results of a succession of two 
similar experiments are the pairs (Z;, Z;,), and they form a new sample 
space. In it probabilities can be assigned in many ways. However, if 
the experimentalist says that two measurements are performed under 
identical conditions, he implies independence; the first outcome should 
have no influence on the second. This means that the two events “‘first 
outcome is ΕΠ, and “second outcome is E;”’ should be stochastically 
independent or that 


(4.1) P{E;, Ex} = pype- 


This equation assigns a probability to every pair (H;, ἔμ). Before we 
can use (4.1) as a definition of probabilities in the new sample space, 
we must show that the quantities p;p, add to unity. Now, in the sum 
DZp;p; each term appears once, and only once, so that Σ ΣΡ; = (pi + 
+ po+...)(p1 + po +...) = 1. Hence (4.1) is acceptable as a defi- 
nition of probabilities. 

Let A and B be two arbitrary events in the original sample space ©. 
We denote the event ‘A occurred at first trial and B at second” by 
(A, B). Suppose A contains the points E,,, E,,, ... and B the points 
Ey,, Es, .... Then (A, B) is the union of all pairs (Ea; Ey,), and as 
before we see that 


(4.2)  P{(A,B)} = ZZ paps, = (Zpa,)(Zpo,) = P{A}P{B}. 


Hence the events A and B are independent. We see that the definition 
(4.1) entails that all events at the second trial be independent of events 
at the first trial. For the purposes of probability theory this describes 
“identical experiments.”’ 

These considerations obviously also apply to a succession of r experi- 
ments and lead to the 


Definition 1. Let S be a sample space with sample points Ey, Eo, ... 
and corresponding probabilities pi, po, .... By r independent trials corre- 
sponding to G we mean the sample space whose points are the r-tuples 
(E;,, πω» ..-, H;,) to which the probabilities 


(4.3) P{(E;,, ἔπ,» «+ +) Hig)} = Di,Dig +++ Dip 


are assigned. 
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In other words, each point of the new space is a sample of size r 
(with possible repetitions) of points of the original space, and prob- 
abilities are defined by the multiplication rule (4.3). The reader is 
reminded that (4.3) is not the only possible definition of probabilities. 
In other words, repeated trials are not necessarily independent. For 
example, the Polya urn scheme [example (2.c)] defines dependent trials. 
Equation (4.3) defines independent trials or, in physical terms, trials 
repeated under identical conditions. 

The argument which led to (4.2) shows more generally the truth of 
the following theorem concerning independent trials. 


Theorem. Suppose that a system of events Ay, Ao, ..., Ay is such 
that the jth trial alone decides whether or not A; occurs; then the events 
Aj, ..., Ay are mutually independent if the trials are independent, that 
as, tf (4.3) holds. 


If Ὁ contains a finite number, N, of points, then there are N’ sample 
points (H;,,..., H;,). If each point of Θ has probability 1/N, then 
(4.3) assigns probability N~" to each point (E;,,...,#;). The new 
approach is conceptually preferable to a formal assignment of equal 
probabilities because it applies to sample spaces with unequal prob- 
abilities and also to infinite sample spaces. It is indispensable for the 
general theory of probability where we consider even a single trial as 
the first in a potentially infinite sequence. We are then dealing only 
with infinite sequences (H;,, E;,, ...) of possible outcomes, and in this 
new space probabilities are defined in a way consistent with (4.3). 
Unfortunately this leads beyond the theory of discrete sample spaces, 
to which the present volume is restricted. We have a more elementary 
theory but pay for it by the necessity of changing the sample space 
according to the number of trials. 

In the preceding discussion we have considered only repetitions of 
the same experiment, but successions of unlike experiments can be 
treated in the same way. If we first toss a coin, then throw a die, we 
naturally assume that the two experiments are independent. This 
amounts to assigning probabilities by the product rule. Thus 
P{ (heads, ace)} = 3-4, etc. In this particular case this is equivalent 
to assigning equal probabilities to all twelve sample points, but in 
general we must proceed as in (4.3). 


Definition 2. Let S’ and S” be two sample spaces and denote their 
points by E";, E's, ... and BE"), E's, .... Let the corresponding proba- 
bilities be p's, p’2, ... and p's, p's, .... The succession of the two 
experiments is described by the space with points (E’;, E'’,). Saying that 
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the two successive experiments are independent means defining probabilities 
by 


(4.4) P{ (Ej, Ε΄Ω} = p'sp". 


[The notions which were just introduced are by no means peculiar to probability 
theory. Given two spaces S’ and ©” with generic points H’ and Ε΄, the set of all 
pairs (E’, Ε΄ is called the combinatorial product of Θ΄ and (Θ΄ and is usually denoted 
by GS’ X ©”. For example, the Cartesian plane, that is the set of pairs (2, y), is 
the combinatorial product of the z-axis and the y-axis. (The three dimensional 
space may be viewed either as triple product or as product of the z, y-plane and 
the z-axis.) The equation (4.4) defines what is usually called the product measure 
of the probabilities in Θ΄ and Θ΄. We used the word experiment as equivalent to 
a sample space with a probability defined in it. Similarly, succession of two inde- 
pendent experiments is short for combinatorial product of the corresponding sample 
spaces with probabilities defined by (4.4). 

These notions carry over in an obvious way to products of any number of spaces. 
For example, in (4.3) there figures the r-tuple combinatorial product of © with 
itself. Where the student of probability speaks of the first, second, ..., trial, other 
mathematicians use the term: first, second, ..., coordinate space. (An event which 
depends only on the outcome of the first trial is also called a cylindrical set over 
the first coordinate space.)] 


The aggregate of all pairs (i, 7) where 7, j are positive integers between 1 and n 
forms the product of the set of integers 1, 2, ..., πὶ with itself. In sampling without 
replacement pairs of the form (2, 7) are forbidden, and therefore taking a sample of 
size two without replacement does not directly lead to a product space. Neverthe- 
less, as the following examples will show, it is possible to represent it in a different 
way as a succession of independent experiments, and the same method applies to 
more complicated cases. 


Examples. (a) Permutations. We have considered the n! permuta- 
tions of a}, dz, ..., ἄμ a8 points of a sample space and attributed prob- 
ability 1/n!to each. We may consider the same sample space as repre- 
senting n — 1 successive experiments as follows. Begin by writing 
down αι. The first experiment consists in putting a2 either before or 
after αι. This done, we have three places for a3 and the second experi- 
ment consists of a choice among them, deciding on the relative order 
of αι, dg, and a3. In general, when ay, ..., a; are put into some rela- 
tive order, we proceed with experiment auraber k, which consists in 
selecting one of the k + 1 places for a,41. In other words, we have a 
succession of n — 1 experiments of which the kth can results in k dif- 
ferent choices (sample points), each having probability 1/k. The ex- 
periments are independent, that is, the probabilities are ii. aes 
Each permutation of the n elements has probability ᾧ - 1/n, in 
accordance with the original definition. 

(Ὁ) Sampling without replacement. Let the population be (a, ..., 
a,). In sampling without replacement each choice removes an element. 
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After k steps there remain n — k elements, and the next choice can be 
described by specifying the number » of the place of the element chosen 
(ν Ξ 1, 2, ..., n—k). In this way the taking of a sample of size r 
without replacement becomes a succession of r experiments where the 
first has n possible results, the second ἢ — 1, the third n — 2, ete. 
We attribute equal probabilities to all results of the individual experi- 
ments and postulate that the r experiments are independent. This 
amounts to attributing probability 1/(n), to each sample in accordance 
with our definition of random samples. (Note that for n = 100, r = 3, 
the sample (a13, @49, @g1) means choices number 13, 39, 79, respectively. 
We must say that at the third experiment the seventy-ninth element 
of the reduced population of n — 2 was chosen, for with the original 
numbering the outcomes of the third experiment would depend on the 
first two choices.) We see that the notion of repeated independent 
experiments permits us to study sampling as a succession of individual 
operations. 


*5. APPLICATIONS TO GENETICS 


The theory of heredity, originated by G. Mendel (1822-1884), pro- 
vides instructive illustrations for the applicability of simple probability 
models. We shall restrict ourselves to indications concerning the most 
elementary problems. In describing the biological background, we shall 
necessarily oversimplify and concentrate on such facts as are pertinent 
to the mathematical treatment. 

Heritable characters depend on special carriers, called genes. All 
cells of the body, except the reproductive cells or gametes, carry exact 
replicas of the same gene structure. The salient fact is that genes ap- 
pear in pairs. The reader may picture them as a vast collection of beads 
on short pieces of string, the chromosomes. These also appear in pairs, 
and paired genes occupy the same position on paired chromosomes. In 
the simplest case each gene of a particular pair can assume two forms 
(alleles), A anda. Then three different pairs can be formed, and, with 
respect to this particular pair, the organism belongs to one of the three 
genotypes AA, Aa, aa (there is no distinction between Aa and aA). For 
example, peas carry a pair of genes such that A causes red blossom 
color and a causes white. The three genotypes are in this case distin- 
guishable as red, pink, and white. Each pair of genes determines one 
heritable factor, but the majority of observable properties of organisms 
depend on several factors. For some characteristics (e.g., eye color and 
left-handedness) the influence of one particular pair of genes is pre- 


* This section treats a special subject and may be omitted. 
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dominant, and in such cases the effects of Mendelian laws are readily 
observable. Other characteristics, such as height, can be understood 
as the cumulative effect of a very large number of genes [cf. example 
X(5.c)]. Here we shall study genotypes and inheritance for only one 
particular pair of genes with respect to which we have the three geno- 
types AA, Aa, aa. Frequently there are N different forms 4.1, ..., Aw 


N+1 
for the two genes and, accordingly, ( 2 ) genotypes A,A,, 414., 


..., AnAy. The theory applies to this case with obvious modifica- 
tions (cf. problem 27). The following calculations apply also to the 
case where A is dominant and a recessive. By this is meant that Aa- 
individuals have the same observable properties as AA, so that only 
the pure aa-type shows an observable influence of the a-gene. All 
shades of partial dominance appear in nature. Typical partially reces- 
sive properties are blue eyes, left-handedness, etc. 

The reproductive cells, or gametes, are formed by a splitting process 
and receive one gene only. Organisms of the pure AA- and aa-geno- 
types (or homozygotes) produce therefore gametes of only one kind, 
but Aa-organisms (hybrids or heterozygotes) produce A- and a-gametes 
in equal numbers. New organisms are derived from two parental gam- 
etes from which they receive their genes. Therefore each pair includes 
a paternal and a maternal gene, and any gene can be traced back to 
one particular ancestor in any generation, however remote. 

The genotypes of offspring depend on a chance process. At every 
occasion, each parental gene has probability 4 to be transmitted, and 
the successive trials are independent. In other words, we conceive of 
the genotypes of n offspring as the result of n independent trials, each 
of which corresponds to the tossing of two coms. For example, the 
genotypes of descendants of an Aa X Aa pairing are AA, Aa, aa with 
respective probabilities +, 4, 4. An AA Χ aa union can have only 
Aa-oftspring, etc. 

Looking at the population as a whole, we conceive of the pairing of 
parents as the result of a second chance process. We shall investigate 
only the so-called random mating, which is defined by this condition: 
If r descendants in the first filial generation are chosen at random, then 
their parents form a random sample of size r, with possible repetitions, 
from the aggregate of all possible parental pairs. In other words, each 
descendant is to be regarded as the product of a random selection of 
parents, and all selections are mutually independent. Random mating 
is an idealized model of the conditions prevailing in many natural popu- 
lations and in field experiments. However, if red peas are sown in one 
corner of the field and white peas in another, parents of like color will 
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unite more often than under random mating. Preferential selectivity 
(such as blonde preferring blondes) also violates the condition of ran- 
dom mating. Extreme non-random mating is represented by self-fer- 
tilizing plants and arti‘cial inbreeding. Some such assortative mating 
systems will be analyzed mathematically, but for the most part we shall 
restrict our attention to random mating. 

The genotype of an offspring is the result of four independent random 
choices. The genotypes of the two parents can be selected in 3-3 ways, 
their genes in 2-2 ways. However, we may combine two selections and 
describe the process as one of double selection thus: The paternal and 
maternal gene are each selected independently and at random from 
the population of all genes carried by males or females of the parental 
population. 

Suppose that the three genotypes AA, Aa, aa occur among males 
and females in the same ratios, u:2v:w. We shall suppose u + 2v + 
+ w= 1 and call u, 2υ, w, the genotype frequencies. Put 


(5.1) p=uty, q=vu+uw. 


Clearly the numbers of A- and a-genes are as p:q, and since p + q = 1 
we shall call p and gq the gene frequencies of A and a. In each of the 
two selections an A-gene is selected with probability p, and, because of 
the assumed independence, the probability of an offspring being AA 
is p*. The genotype Aa can occur in two ways, and its probability is 
therefore 2pq. Thus, under random mating conditions an offspring 
belongs to the genotypes AA, Aa, or aa with probabilities 


(5.2) UW = Dp’, 2v; = 2pq, w=. 


Examples. (a) All parents are Aa (heterozygotes); then u = w = 0, 
20 = 1, and p=q = 4. (b) AA- and aa-parents are mixed in equal 


proportions; then u = ὦ = 4, v = 0, and again p = q = 4. (c) Fi- 
1 


nally, u = w = 4, 2v = 4; again p = q = 3. In all three cases we 
have for the filial generation u, = 4, 2v, = 4, wy = 2. 


For a better understanding of the implications of (5.2) let us fix the 
gene frequencies p and g (p + ῳ = 1) and consider all systems of geno- 
type frequencies u, 2v, w for which ὦ -+v = pandv+w=q. They 
all lead to the same probabilities (5.2) for the first filial generation. 
Among them there is the particular distribution 


(5.3) u =p", 2v = 2pgq, w=. 


If the frequencies wu, v, w in the original generation stand in the par- 
ticular relation (5.3)—as in example c—then we find for the genotype 


124 CONDITIONAL PROBABILITY [V.5 


probabilities in the first filial generation u; = wu, v1 = v, and w,; = w. 
Therefore we call genotype distributions of the form (5.3) stationary. 
To every ratio p:q there corresponds a stationary distribution, or equi- 
librium. 

Equations (5.2) give the genotype probabilities for a randomly se- 
lected individual of the second generation. In a large population we 
must expect the actual genotype frequencies to be close to the theo- 
retical distribution.2 Now, whatever the distribution u:2v:w in the 
parental generation, equations (5.2) define a stationary distribution; 
in it the genes A and a appear with frequencies [ef. (5.1)] μὰ + 9, = u + 
+y=pandvy,+w, =v+w=g. In other words, if the observed 
frequencies coincided exactly with the calculated probabilities, then 
the first filial generation would have a stationary genotype distribution 
which would perpetuate itself without change in all succeeding genera- 
tions. In practice, deviations will be observed, but for large popula- 
tions we can say: Whatever the composition of the parent population may 
be, random mating will within one generation produce an approximately 
stationary genotype distribution with unchanged gene frequencies. From 
the second generation on, there is no tendency toward a systematic 
change; a steady state is reached with the first filial generation. This 
was first noticed by G. H. Hardy,* who thus resolved assumed diffi- 
culties in Mendelian laws. It follows in particular that under condi- 
tions of random mating the frequencies of the three genotypes must 
stand in the ratios p?:2pq:q”. This can in turn be used to check the 
assumption of random mating. 

Hardy also pointed out that emphasis must be put on the word 
“approximately.” Even with a stationary distribution we must expect 
small changes from generation to generation, which leads us to the fol- 
lowing picture. Starting from any parent population, random mating 
tends to establish the stationary distribution (5.8) within one genera- 
tion. For a stationary distribution there is no tendency toward a sys- 
tematic change of any kind. However, chance fluctuations will change 


3 Without this our probability model would be void of operational meaning. The 
statement is made precise by the law of large numbers and the central limit the- 
orem, which permits us to estimate the effect of chance fluctuations. 

4G. H. Hardy, Mendelian proportions in a mixed population, Letter to the 
Editor, Science, N.S., vol. 28 (1908), pp. 49-50. Anticipating the language of 
chapters IX and XV, we can describe the situation as follows. The frequencies of 
the three genotypes in the nth generation are three random variables whose ex- 
pected values are given by (5.2) and do not depend on n. Their actual values will 
vary from generation to generation and form a stochastic process of the Markov 


type. 
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the gene frequencies p and q from generation to generation, and the 
genetic composition will slowly drift. There are no restoring forces 
seeking to re-establish original frequencies. On the contrary, our sim- 
plified model leads to the conclusion [cf. example X V(2.k)] that, for a 
population bounded in size, one gene should ultimately die out, so that 
the population would eventually belong to one of the pure types, AA 
or aa. In nature this does not necessarily occur because of the crea- 
tion of new genes by mutations, selections, and many other effects, 
which can be studied by more refined mathematical tools (Markov 
chains, diffusion theory). 

Hardy’s theorem is frequently interpreted to imply a strict stability 
for all times. It is a common fallacy to believe that the law of large 
numbers acts as a force endowed with memory seeking a return to the 
original state, and many wrong conclusions have been drawn from this 
assumption. (The biological processes here considered are typical of 
the important class of Markov processes which will be studied in detail 
in chapter XV.) Note that Hardy’s law does not apply to the distri- 
bution of two pairs of genes (e.g., eye color and left-handedness) with 
the nine genotypes AABB, AABb, ..., aabb. There is still a tendency 
toward a stationary distribution, but equilibrium is not reached in the 
first generation (cf. problem 31). 


*6. SEX-LINKED CHARACTERS 


In the introduction to the preceding section it was mentioned that 
genes lie on chromosomes. These appear in pairs and are transmitted 
as units, so that all genes on a chromosome stick together.> Our scheme 
for the inheritance of genes therefore applies also to chromosomes as 
units. Sex is determined by two chromosomes; females are XX, males 
XY. The mother necessarily transmits an X-chromosome, and the sex 
of offspring depends on the chromosome transmitted by the father. 
Accordingly, male and female gametes are produced in equal numbers. 
The difference in birth rate for boys and girls is explained by variations 
in prenatal survival chances. 

It has been said that both genes and chromosomes appear in pairs. 
There is an exception inasmuch as the genes situated on the X-chromo- 
some have no corresponding gene on Y. Females have two X-chromo- 
somes, and hence two of such X-linked genes; however, in males the 
X-genes appear as singles. Typical are two sex-linked genes causing 


* This section treats a special topic and may be omitted. 
6 This picture is somewhat complicated by occasional breakings and recombina- 
tions of chromosomes [ef., problem II(10.12)]. 
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colorblindness and haemophilia. With respect to each of them, females 
can still be classified into the three genotypes, AA, Aa, aa, but, having 
only one gene, males have only the two genotypes A anda. Note that 
a son always has the father’s Y-chromosome so that a sex-linked char- 
acter cannot be inherited from father to son. However, it can pass 
from father to daughter and from her to a grandson. 

We now proceed to generalize the analysis of the preceding section. 
Assume again random mating and let the frequencies of the genotypes 
AA, Aa, aa in the female population be u, 2v, w, respectively. As 
before put p=u-+v,q=v-+w. The frequencies of the two male 
genotypes A and a will be denoted by p’ and q’ (p’ + φ' = 1). Then 
p and p’ are the frequencies of the A-gene in the female and male 
populations, respectively. The probability for a female descendant to 
be of genotype AA, Aa, aa will be denoted by uy, 201, w;; the analogous 
probabilities for the male types A and a are p’;, q’;. Now a male off- 
spring receives his X-chromosome from the female parent, and hence 


(6.1) ρι ΞΡ, 4154. 

For the three female genotypes we find, as in section 5, 
(6.2) u = pp’, ὅφυι Ξ ρᾳ +g’, ὦ - ᾳα. 
Hence 


(63) pr=uwmtui= ptr), ag=ntwm=3qt_’. 


We can interpret these formulas as follows. Among the male de- 
scendants the genes A and a appear approximately with the frequencies 
p, ᾳ of the maternal population; the gene frequencies among female 
descendants are approximately p; and qi, or halfway between those 
of the paternal and maternal populations. We discern a tendency 
toward equalization of the gene frequencies. In fact, from (6.1) and 
(6.3) we get 


(6.4) pPi-m=3p-p’), gd—-u=sq-). 


This means that random mating will in one generation reduce approxi- 
mately by one-half the differences between gene frequencies among 
males and females. However, it will not eliminate the differences, and 
a tendency toward further reduction will subsist. In contrast to 
Hardy’s law, we have here no stationary situation after one generation. 
We can pursue the systematic component of the changes from genera- 
tion to generation by neglecting chance fluctuations and identifying 
the theoretical probabilities (6.2) and (6.3) with corresponding actual 
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frequencies in the first filial generation. For the second generation we 
obtain by the same process 


(6.5) po=2(p1+0'1)= ἱρ Ὁ ip’, a= ξ(σι -᾿ 41) = 36 -᾿ Fd’ 
and, of course, p’s = ρι, 42 = q:. A few more trials will lead to the 


general expression for the probabilities p, and gn among females of the 
nth descendant generation. Put 


(6.6) a= 3(2p+p'), 6 = ξ(ῶᾳ 4}. 
(Note thata + 68= 1.) Then 


Pn—1 + Dp n—1 p— p’ 
Se = --1ὴπ ; 
Pn 2 a + ( ) 3.95 
(6.7) 
Qn—1 + 4᾽,.--1 ᾳ- q’ 
(SS [Ξ- + -- n ’ 
q ᾿ P+ {5} Boe 


and ρ΄, = Dn_1, O'n = Qn_1- Hence 
(6.8) Pn σα, Dn— a, Qn — B, Gn — β. 


The genotype frequencies in the female population, as given by (6.2), 
are 


(6.9) Un = Pn—iP'n—-1, 20n = Pn—19'n—1 + Yn—1P'n—1, 


/ 
Wn = Qn—19 n—1- 
Hence 


(6.10) Un —> a, 20n —> 208, Wn — B?. 


These formulas show that there is a strong systematic tendency, 
from generation to generation, toward a state where the genotypes A 
and a appear among males with frequencies a and 8, and the female 
genotypes AA, Aa, aa have probabilities a”, 208, 6, respectively. The 
convergence is very fast, as indicated by (6.7). In practice, equilib- 
rium will be reached after three or four generations. To be sure, small 
chance fluctuations will be superimposed on the described changes, but 
the latter represent the prevailing systematic tendency. 

Our main conclusion is that under random mating we can expect the 
sex-linked genotypes A and a among males, and AA, Aa, aa among 


6 In the terminology introduced in footnote 4 we can interpret pn and gn as the 
expected values of the gene frequencies in the nth female generation. With this 
interpretation the formulas for pz and g, are no longer approximations but exact. 
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females to occur approximately with the frequencies a, B, a”, 2a8, 8”, 
respectively, where a + 6 = l. 


Application. Many sex-linked genes, like colorblindness, are reces- 
sive and cause defects. Let a be such a gene. Then all a-males and 
all aa-females show the defect. Females of Aa-type may transmit the 
defect to their offspring but are not themselves affected. Hence we 
expect that a recessive sex-linked defect which occurs among males with 
frequency a occurs among females with frequency α΄. If one man in 100 


is colorblind, one woman in 10,000 should be affected. 


*7. SELECTION 


As a typical example of the influence of selection we shall investigate 
the case where aa-individuals cannot multiply. This happens when 
the a-gene is recessive and lethal, so that aa-individuals are born but 
cannot survive. Another case occurs when artificial interference by 
breeding or laws prohibits mating of aa-individuals. 

Assume random mating among AA- and Aa-individuals but no mat- 
ing of aa-types. Let the frequencies with which the genotypes AA, 
Aa, aa appear in the total population be u, 2v, w. The corresponding 
frequencies for parents are then 

U 2v 


(7.1) “u* = , 20* = 
l—w 1-—w 


We can proceed as in section 5, but we must use the quantities (7.1) 
instead of u, 2v, w. Hence, (5.1) is to be replaced by 


utov v 


cap. aaa 


(7.2) p= 


The probabilities of the three genotypes in the first filial generation 
are again given by (5.2) or μι = p”, 201 = 2pq, wi = ηἷ. 

As before, in order to investigate the systematic changes from genera- 
tion to generation, we have to replace u, v, w by uy, v1, w, and thus 
obtain probabilities ue, ve, we for the second descendant generation, 
etc. In general we get from (7.2) 


Un + v Un 
(7.3) Pr = με an = 
1 — Wn 1 — Wn 
and 
(7.4) Unti = Pn’ 2υ,...1 = 20nQn; Wa+1 = Qn’. 


* This section treats a special subject and may be omitted. 
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A comparison of (7.3) and (7.4) shows that 


Un+1 + Un+1 Ρ 1 
(7.5) De ee 
1 -- Wn41 1 -- Qn 1 + Qn 
and similarly 
Un+1 Qn 
(7.6) Qn+1 = = ᾿ 


1 — Wai 1+ qn 


From (7.6) we can calculate q, explicitly. In fact 


1 1 
(7.7) =1+— 
Qn+1 dn 
whence successively 
1 1 1 1 1 1 1 1 
(7.8) —=i1+- —=2+>-) —=3+-) eoe- —=n+- 
γι 73 2 q 4. 4 Qn q 
or 
2 
4 4 
79 a= Wn - ( ) 
“= pets Τ᾿ nq τ 1 + qn 


We see that the unproductive (or undesirable) genotype gradually 
drops out, but the process is extremely slow. For g = 0.1 it takes ten 
generations to reduce the frequency of a-genes by one-half; this reduces 
the frequency of the aa-type approximately from 1 to + per cent. (If 
a is sex-linked, the elimination proceeds much faster as shown in prob- 
lem 29; for a generalized selection scheme see problem 30.)’ 


8. PROBLEMS FOR SOLUTION 


1. Three dice are rolled. If no two show the same face, what is the probabil- 
ity that one is an ace? 


2. Given that a throw with ten dice produced at least one ace, what is the 
probability p of two or more aces? 


3. Bridge. In a bridge party West has no ace. What probability should 
be attributed to the event of his partner having (a) no ace, (6) two or more 
aces? Verify the result by a direct argument. 

4. Bridge. North and South have ten trumps between them (trumps being 
cards of a specified suit). (a) Find the probability that all three remaining 
trumps are in the same hand (either East or West has no trumps). (6) If it is 
known that the king of trumps is included among the three, what is the proba- 
bility that he is “unguarded” (that is, one player has the king, the other the 
remaining two trumps)? 


7 For a further analysis of various eugenic effects (which are frequently different 
from the ideas of enthusiastic proponents of sterilization laws) see G. Dahlberg, 
Mathematical methods for population genetics, New York and Basel, 1948. 


180 CONDITIONAL PROBABILITY [V.8 


5. Discuss the key problem in example II(7.b) in terms of conditional proba- 
bilities following the pattern of example (2.0). 


6. Ina bolt factory machines A, B, C manufacture, respectively, 25, 35, and 
40 per cent of the total. Of their output 5, 4, and 2 per cent are defective 
bolts. A bolt is drawn at random from the produce and is found defective. 
What are the probabilities that it was manufactured by machines A, B, C? 


7. Suppose that 5 men out of 100 and 25 women out of 10,000 are color- 
blind. A colorblind person is chosen at random. What is the probability of 
his being male? (Assume males and females to be in equal numbers.) 


8. Seven balls are distributed randomly in seven cells; the probabilities of 
the various arrangements are tabulated in table 1 of chapter II, section 5. 
Using this table, verify that the probability of a cell’s being triply occupied, 
given that exactly two cells are empty, is ᾧ to five decimals. Show that + is 
the correct answer. 


9. A die is thrown as long as necessary for an ace to turn up. Assuming 
that the ace does not turn up at the first throw, what is the probability that 
more than three throws will be necessary? 


10. Continuation. Suppose that the number, n, of throws is even. What 
is the probability that n = 2? 

11. Let 8 the probability p, that a family has exactly n children be ap" when 
n> 1, and p =1—ap(l1+p+p?+...). Suppose that all sex distribu- 
tions of n children have the same probability. Show that for k > 1 the proba- 
bility that a family contains exactly k& boys is 2ap*/(2 — p)*+1, 


12. Continuation. Given that a family includes at least one boy, what is 
the probability that there are two or more? 


13. Die A has four red and two white faces, whereas die B has two red and 
four white faces. A coin is flipped once. If it falls heads, the game continues 
by throwing die A alone; if it falls tails, die B is to be used. (a) Show that the 
probability of red at any throw is ἃ. (0) If the first two throws resulted in red, 
what is the probability of red at the third throw? (c) If red turns up at the 
first n throws, what is the probability that die A is being used? (d) To which 
urn model is this game equivalent? 

14. In example (2.a) let 2, be the probability that the winner of the nth 
trial wins the entire game; let y, and 2, be the probabilities of victory for the 
losing and the pausing player, respectively, of the nth trial. (a) Show that 


1 es A --1 
(x) in = 72 + SY n+) Un = 22n+41) an = 9%n+1- 


(6) Show by a direct simple argument that in reality zn = 2, Yn = Y, Zn = 2 
are independent of n. (c) Conclude that the probability that player a wins 
the game is 3% (in agreement with problem I, 5). (ὦ) Show that z, = 4, 
Yn = 7, 2, = # is the only bounded solution of (*). 

15. Let the events A;, Ao, ..., An be independent and P{A;,} = py. Find 
the probability p that none of the events occurs. 


8 According to A. J. Lotka, American family statistics satisfies our hypothesis 
with p = 0.7358. See Théorie analytique des associations biologiques II, Actualités 
scientifiques et industrielles, no. 780. Paris, 1939. 


V.8] PROBLEMS FOR SOLUTION 131 


16. Continuation. Show that always p < e~*?x. 

17. Continuation. From Bonferroni’s inequality IV(5.7) deduce that the 
probability of & or more of the events Ai, ..., An occurring simultaneously 
is less than (p1 +...-+ pn)*/k!. 

18. Τὸ Polya’s urn scheme, example (2.c). Given that the second ball was 
black, what is the probability that the first was black? 

19. Το Polya’s urn scheme, example (2.c). Show by induction that the proba- 
bility of a black ball at any trial is b/(6 + 7). 

20. Continuation. Prove by induction: for any m <n the probabilities 
that the mth and the nth drawings produce black, black or black, red are — 


b(b + c) br 
τ τὺ (b+nb+r+o) 


respectively. Generalize to more than two drawings. 

21. Time symmetry of Polya’s scheme. Let A and B stand for either black 
or red (so that AB can be any of the four combinations). Show that the proba- 
bility of A at the nth drawing, given that the mth drawing yields B, is the 
same as the probability of A at the mth drawing when the nth drawing yields B. 

22. In the Polya scheme let p;(n) be the probability of & black balls in the 
first n drawings. Prove the recurrence relation 


—k b k-—1 
p(n - 1) = p(n) ita ΙΝ + pe_i(n) peo 


where p_i(n) is to be interpreted as 0. Use this relation for a new proof of 
(2.3). 
23. The Polya distribution. In (2.4) set 


(8.1) hace ἢ" b4+r. “Ὁ baa 
Show that 
(- Pp (- 4 
= Yim 
(8.2) Prin = : n= + Ne, 


remains meaningful for arbitrary (not necessarily rational) constants p > 0, 
q>0,y > —1 such that p +q=1. Verify that pn... > 0 and 


2, Dn = 1. 


Thus equation (8.2) defines a probability distribution on the integers 0, 1, ..., 
n, the Polya distribution. 

24. Limiting form of the Polya distribution. If n — ~, p + 0,7 — 0 80 
that np — dr, ny — p—, then 


pan CHE) GEN GEN 
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Verify this and show that for fixed A, p the terms on the right add to unity. 
(The right side represents the so-called negative binomial distribution; cf. chap- 
ter VI, section 8, and problem VI, 37.) 

25. Interpret equation JI(11.8) in terms of conditional probabilities. 


Applications in Biology 


26. Under random mating less than half the population belongs to genotype 
Aa. 
27. Generalize the results of section 5 to the case where each gene can have 


k+1 : 
any of the forms Ai, 45, ..., Ax, so that there are ( 2 ) genotypes instead 


of three (multiple alleles). 

28. Brother-sister mating. Two parents are selected at random from a popu- 
lation in which the genotypes AA, Aa, aa occur with frequencies u, 2v, w. 
This process is repeated in their progeny. Find the probabilities that both 
parents of the first, second, third filial generation belong to AA [cf. examples 
XV(2.l) and XVI(4.6)]. 

29. Selection. Let a be a recessive sex-linked gene, and suppose that a 
selection process makes mating of a-males impossible. If the genotypes AA, 
Aa, aa appear among females with frequencies u, 2v, w, show that for female 
descendants of the first generation uy = u + v, 2υι = v + w, w, = 0 and hence 
ρι =p+43q,%= 29. That is to say, the frequency of the a-gene among 
females is reduced to one-half. 

30. The selection problem of section 7 can be generalized by assuming that 
only the fraction \(0 < A < 1) of the aa-class is eliminated. Show that 

utov _v+(—Ajw 


Ρ- το χω᾿ 4 1— λυ 
More generally, (7.3) is to be replaced by 


ΒΝ _, Lor, 
Pati = i— Non Qn+1 dn = Nn 


(The general solution of these equations appears to be unknown.) 

31. Consider simultaneously two pairs of genes with possible forms (A, a) 
and (B, b), respectively. Any person transmits to each descendant one gene 
of each pair, and we shall suppose that each of the four possible combinations 
has probability +. (This is the case if the genes are on separate chromosomes; 
otherwise there is strong dependence.) There exist nine genotypes, and we as- 
sume that their frequencies in the parent population are Uaasp, Uaaps, Uaars, 
Uaass, 2U saps, 2U san, 2U aap, 2U aap, 4UscBs. Put pap = Usass + 
+ Uasaans + Usaps + Uasass, pas = Usa + Usars + Uaann + Uasanz, 
PaB = Ucapp + Ucans + Usass + Usape, Par = Usars + Usars + Uaans + 
+ Uses». Compute the corresponding quantities for the first descendant 
generation. Show that for it p?} = pas — 6, Ῥα = pas + δ, pop = Das + δ, 
p> = pap — ὃ with 25 = pasPad — Ῥαῦραβ. The stationary distribution is 
given by pag — 26 = pap + 26, etc. (Notice that Hardy’s law does not apply; 
the composition changes from generation to generation.) 
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32. Assume that the genotype frequencies in a population are u = p*, 
2v = 2pq, w = ᾳῇῆ. Given that a man is of genotype Aa, the probability that 
his brother is of the same genotype is (1 + pq)/2. 


Note: The following problems are on family relations and give a meaning to the 
notion of degree of relationship. Each problem is a continuation of the preceding 
one. Random mating and the notations of section 5 are assumed. We are here con- 
cerned with a special case of Markov chains (cf. chapter XV). Matrix algebra simplifies 
the writing. 


33. Number the genotypes AA, Aa, aa by 1, 2, 3, respectively, and let 
pix(t, k = 1, 2,3) be the conditional probability that an offspring is of geno- 
type k if it is known that the male (or female) parent is of genotype ἡ. Com- 
pute the nine probabilities ρὲ», assuming that the probabilities for the other 
parent to be of genotype 1, 2, 3 are p?, 2pq, q’, respectively. 

34. Show that p, is also the conditional probability that the parent is of 
genotype k if it is known that a specified offspring is of genotype 7. 

35. Prove that the conditional probability of a grandson (grandfather) to 
be of genotype & if it is known that the grandfather (grandson) is of genotype 
ὦ is given by 

Dit) = DaPue + Pipe + Ῥιβρϑι. 


[The matrix (p{?’) is the square of the matrix (p.z).] 

36. Show that p{? is also the conditional probability that a man is of geno- 
type ὦ if it is known that a specified half-brother is of genotype 7. 

37. Show that the conditional probability of a man to be of genotype ἃ 
when it is known that a specified great-grandfather (or great-grandson) is of 
genotype 72 is given by 


p> Ξ DP pir ἮΝ Pp} pox + PB Dax = piph + pips + DisP $e - 


(The matrix (p{}) is the third power of the matrix (px). This procedure gives 
a precise meaning to the notion of the degree of family relationship.) 

38. More generally, define probabilities p that a descendant of nth genera- 
tion is of genotype k if a specified ancestor was of genotype ἢ. Prove by induc- 
tion that the pS? are given by the elements of the following matrix 


p? + pq/2"—* 2pq + oq — Ρ),2" 1} g? — φ' 251 
p+ pq — p)/2" 2pq + (1 — 4pq)/2" α' + gp — g)/2” ]- 
p? — p?/2r— 2pq + pip — 4,25} + »ᾳ 25} 
(This shows that the influence of an ancestor decreases from generation to 
generation by the factor 5.) 


® The first edition contained an error since the word brother (two common parents) 
was used where a half-brother was meant. This error is pointed out and the cor- 
rect formulas are given in C. C. Li and Louis Sacks, The derivation of the joint 
distribution and correlation between relatives by the use of stochastic matrices, 
Biometrika, vol. 40 (1954), pp. 347-360. 
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39. Consider the problem 36 for a full brother instead of a half-brother. 
Show that the corresponding matrix is 


a1+p)? g¢1 +79) 3¢ 
tpl+p) 311+ pq) 4911 +4) }- 
1;" 8,4 4) 21+? 
40. Show that the degree of relationship between uncle and nephew is the 
same as between grandfather and grandson. 


CHAPTER VI 


The Binomial 
and the Poisson Distributions 


1, BERNOULLI TRIALS! 


Repeated independent trials are called Bernoulli trials if there are only 
two possible outcomes for each trial and their probabilities remain the same 
throughout the trials. It is usual to denote the two probabilities by Pp 
and q, and to refer to the outcome with probability p as “success,” S, 
and to the other as “failure,” F. Clearly, p and q must be non-nega- 
tive, and 


(1.1) pt+q=l. 


The sample space of each individual trial is formed by the two points 
Sand F, The sample space of n Bernoulli trials contains 2” points or 
successions of n symbols S and F, each point representing one possible 
outcome of the compound experiment. Since the trials are independ- 
ent, the probabilities multiply. In other words, the probability of any 
specified sequence is the product obtained on replacing the symbols S and F 
by p and φ, respectwely. Thus P{(SSFSF ... FFS)} = ppgpg --- ggp. 


Examples. The most familiar example of Bernoulli trials is pro- 
vided by successive tosses of a true or symmetric coin; here p = ᾳ - i. 
If the coin is unbalanced, we still assume that the successive tosses are 
independent so that we have a model of Bernoulli trials in which the 
probability p for success can have an arbitrary value. Repeated ran- 
dom drawings from an urn containing at each drawing r red and b black 
balls represent Bernoulli trials with p = r/(r + δ). Often we have no 
interest in distinguishing among several outcomes and prefer to de- 
scribe any result simply as A or non-A. Thus with good dice the dis- 
tinction between ace (S) and non-ace (F) leads to Bernoulli trials with 


1 James Bernoulli (1654-1705). His main work, the Ars conjectandi, was pub- 
lished in 1713. 7 
135 
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p = 4, whereas distinguishing between even or odd leads to Bernoulli 
trials with p = 4. If the die is unbalanced, the successive throws still 
form Bernoulli trials, but the corresponding probabilities Ὁ are dif- 
ferent. Royal flush in poker or double ace in rolling dice may represent 
success; calling all other outcomes failure, we have Bernoulli trials with 

= 1/649,740 and p = +g, respectively. Reductions of this type are 
usual in statistical applications. For example, washers produced in 
mass production may vary in thickness, but, on inspection, they are 
classified as conforming (S) or defective (F) according as their thick- 
ness is, or is not, within prescribed limits. 


The Bernoulli scheme of trials is a theoretical model, and only ex- 
perience can show whether it is suitable for the description of specified 
experiments. Our knowledge that successive tossings of a coin con- 
form to the Bernoulli scheme is derived from experimental evidence. 
The man in the street, and also the philosopher K. Marbe,? believes 
that after a run of seventeen heads tail becomes more probable. This 
argument has nothing to do with imperfections of physical coins; it 
endows nature with memory, or, in our terminology, it denies the 
stochastic independence of successive trials. Marbe’s theory cannot 
be refuted by logic but is rejected because of lack of empirical support. 

In sampling practice, industrial quality control, etc., the scheme of 
Bernoulli trials provides an ideal standard even though it can never 
be fully attained. Thus, in the example above of the production of 
washers, there are many reasons why the output cannot conform to 
the Bernoulli scheme. The machines are subject to changes, and hence 
the probabilities do not remain constant; there is a persistence in the 
action of machines, and therefore long runs of deviations of like kind 
are more probable than they would be if the trials were truly independ- 
ent. From the point of view of quality control, however, it is desirable 
that the process conform to the Bernoulli scheme, and it is an important 
discovery that, within certain limits, production can be made to behave 
in this way. The purpose of continuous control is then to discover at 
an early stage flagrant departures from the ideal scheme and to use 
them as an indication of impending trouble. 


2. THE BINOMIAL DISTRIBUTION 


Frequently we are interested only in the total number of successes 
produced in a succession of n Bernoulli trials but not in their order. 


2 Die Gleichfarmigkeit in der Welt, Munich, 1916. There exists a huge critical 
literature on Marbe’s theory. 
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The number of successes can be 0, 1, ..., ἢ, and our first problem is 
to determine the corresponding probabilities. Now the event “ἢ trials 
result in k successes and n — k failures” can happen in as many ways 
as k letters S can be distributed among n places. In other words, our 


᾿ 
event contains (") points, and, by definition, each point has the prob- 
ability p*g"”—*. This proves the 


Theorem. Let b(k;n, p) be the probability that n Bernoulli trials with 
probabilities Ὁ for success and q = 1 — p for fatlure result in k successes 
and n — k failures (0 <k <n). Then 


(2.1) b(k; n, p) = [ pig. 


In particular, the probability of no success is g”, and the probability 
of at least one success is 1 — gq”. 


We shall treat p as a constant and denote the number of successes 
in n trials by S,; then b(k;n, p) = P{S, =k}. In the general ter- 
minology S, is a random variable, and the function (2.1) is the “distri- 
bution” of this random variable; we shall refer to it as the binomial 
distribution. The attribute “binomial” refers to the fact that (2.1) rep- 
resents the kth term of the binomial expansion of (¢ + p)”. This re- 
mark shows also that b(O;n, p) + b(1;n, p) +...+ b(n; n, p) = 
= (¢+ p)” = 1, as is required by the notion of probability. The 
binomial distribution has been tabulated. 


Examples. (a) Weldon’s dice data. Let an experiment consist in 
throwing twelve dice and let us count fives and sixes as “success.” If 
the dice are perfect, the probability of success is p = 4 and the number 
of successes should follow the binomial distribution b(k; 12,4). Table 
1 gives these probabilities, together with the corresponding observed 
average frequencies in 26,306 actual experiments. The agreement looks 
good, but for such extensive data it is really very bad. Statisticians 
usually judge closeness of fit by the chi-square criterion. According 
to it, deviations as large as those observed would happen with true dice 
only once in 10,000 times. It is, therefore, reasonable to assume that 


8’ For n < 50, see National Bureau of Standards, Tables of the binomial probabil- 
ity distribution, Applied Mathematics Series, vol. 6 (1950). For 50 < n < 100, see 
H. C. Romig, 50-100 Binomial tables, John Wiley and Sons, 1953. For a wider 
range see Tables of the cumulative binomial probability distribution, by the Harvard 
Computation Laboratory, 1955, and Tables of the cumulative binomial probabilities, 
by the Ordnance Corps, ORDP 20-11 (1952). 


188 BINOMIAL POISSON DISTRIBUTIONS [V1.2 


the dice were biased. A bias with probability of success p = 0.3377 
would fit the observations. 


TABLE 1 


WELpDoN’s Dick Data 


Observed 
k b(k; 12, 4) Frequency b(k; 12, 0.3377) 
0 0.007 707 0.007 033 0.007 123 
] .046 244 .043 678 .043 584 
2 127 171 124 116 122 225 
3 211 952 208 127 .207 736 
4 238 446 232 418 238 324 
5 .190 757 197 445 .194 429 
6 .111 275 .116 589 .115 660 
7 .047 689 .050 597 .050 549 
8 .014 903 .015 320 .016 109 
9 .003 312 .003 991 .003 650 
10 .000 497 .000 532 -000 558 
11 .000 045 .000 152 .000 052 
12 .000 002 .000 000 .000 002 


(δ) In chapter IV, section 4, we have encountered the binomial dis- 
tribution in connection with a card-guessing problem, and the columns 
bm of table 3 exhibit the terms of the distribution for n = 8, 4, 5, 6, 10 
and p= 1/n. In the occupancy problem II(4.c) we found formula 
II(4.5), which is another special case of the binomial distribution. 

(c) If the probability of success is 0.01, how many trials are neces- 
sary in order for the probability of at least one success to be 4 or more! 
Here we seek the smallest integer n for which 1 — (0.99)” > 2, 
—n log (0.99) > log 2; therefore n > 70. 

(d) A power pipply problem. Suppose that n = 10 workers are to 
use intermittently electric power, and we are interested in estimating 
the total load to be expected. For a crude approximation imagine that 
δῦ any given time each worker has the same probability p of requiring 
a unit of power. If they work independently, the probability of exactly 
k workers requiring power at the same time should be b(k; n, p). Τῇ, 
on the average, a worker uses power for 12 minutes per hour, we would 
put p = 3. The probability of seven or more workers requiring cur- 


4R. A. Fisher, Statistical methods for research workers, Edinburgh-London, 1932, 
p. 66, or T. C. Fry, Probability and its engineering uses, New York, 1928, pp. 303ff. 
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rent at the same time is then 6(7; 10, 0.2) +...+ 6(10; 10, 0.2) = 
= 0.0008643584. In other words, if the supply is adjusted to six power 
units, an overload has probability 0.00086 ... and should be expected 
for about one minute in 1157, that is, about one minute in twenty 
hours. The probability of eight or more workers requiring current 
at the same time is only 0.0000779264 or about eleven times less. 

(e) Testing sera or vaccines.’ Suppose that the normal rate of infec- 
tion of a certain disease in cattle is 25 per cent. To test a newly dis- 
covered serum 7 healthy animals are injected with it. How are we to 
evaluate the result of the experiment? If the serum is absolutely 
worthless, the probability that exactly k of the n test animals remain 
free from infection may be equated to b(k;n, 0.75). For k = n = 10 
this probability is about 0.056, and for k = n = 12 only 0.032. Thus, 
if out of ten or twelve test animals none catches infection, this may be 
taken as an indication that the serum has had an effect, although it is 
not a conclusive proof. Note that, without serum, the probability that 
out of seventeen animals at most one catches infection is about 0.0501. 
It is therefore stronger evidence in favor of the serum if out of seventeen 
test animals only one gets infected than if out of ten all remain healthy. 
For n = 23 the probability of at most two animals catching infection 
is about 0.0492, and thus two failures out of twenty-three is again 
better evidence for the serum than one out of seventeen or none out 
of ten. 

(7) Another statistical test. Suppose n people have their blood pres- 
sure measured with and without a certain drug. Let the observations 
be 21, ..-,%n and 21, ...,2’n. We say that the th trial resulted in 
success if x; < x’;, and in failure if x; > 2’;. (For simplicity we may 
assume that no two measurements lead to exactly the same result.) If 
the drug has no effect, then our observation should correspond to n 
Bernoulli trials with p = 4, and an excessive number of successes is 
to be taken as evidence that the drug has an effect. 


3. THE CENTRAL TERM AND THE TAILS 
From (2.1) we see that 
b(kin,p) _ (m@—k+DP_ |, wt p—k 
b(k — 1;n, p) kq kq 
Accordingly, the term b(k; 7, p) is greater than the preceding one for 


(3.1) 


5 P. V. Sukhatme and V. G. Panse, Size of experiments for testing sera or vac- 
cines, Indian Journal of Veterinary Science and Animal Husbandry, vol. 13 (1948), 
pp. 75-82. 
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k « (n+ 1)p and is smaller fork > (n+ 1)p. If (n + 1)} = m hap- 
pens to be an integer, then b(m;n, p) = b(m—1;n, p). There exists 
exactly one integer m such that 


(3.2) (n+ 1)p—-l<m< (n+ 1)», 


and we have the 


Theorem 1. As k goes from 0 to n, the terms b(k; n, p) first increase 
monotonically, then decrease monotonically, reaching their greatest value 
when k = m, except that b(m—1;n, p) = b(m;n, p) whenm = (n+ 1)p. 


We shall call b(m;n, p) the central term. Often m is called “the 
most probable number of successes,” but it must be understood that 
for large values of n all terms b(k; τ, p) are small. In 100 tossings of 
a true coin the most probable number of heads is 50, but its probability 
is less than 0.08. In the next chapter we shall find that b(m;n, p) is 
approximately (2rnpq)™. 

It is obvious that the ratio in formula (3.1) decreases monotonically 
as k increases; thus, when & > r + 1 


b(k; n, p) Z (n — ΠΡ. 


(3.3) S$ ------ς--- 
bk —1;n,p) (r+ 1)q¢ 


Set herein k = r+1, ...,7r+v and multiply the v inequalities to obtain 


Or + 752, B) {= πὶ oe". 


(3.4) ---.-.--- 
b(r; n, p) (r+ lq 


Forr > np the fraction within braces is less than unity, and summation 
over v leads to a finite geometric series with ratio (n — r)p/(r + 1)q. 
We conclude that for r > np 


ἔπι (r + 1)4 
3.9 b(r + vin, p) < b(r; n, p) ———— 


On the left we have the right “‘tail’’ of the binomial distribution, namely 
the probability of at least r successes. The same calculation applied 
to the left tail shows that for 8 < np 


8 | (n --: ὃ + ἢ» 
(3.6) py b(p; n, Dp) « b(s; n, p) (n + 1)p —s 


We have proved 
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Theorem 2. If r > np, the probability of at least r successes satisfies 
the inequality (3.5); if s < np, the probability of at most 8 successes satis- 
fies the inequality (8.6). 


[For an alternative proof see problem 39(a).] 


4. THE LAW OF LARGE NUMBERS 


On several occasions we have mentioned that our intuitive notion of 
probability is based on the following assumption. If in n identical trials 
A occurs v times, and if n is very large, then »/n should be near the 
probability p of A. Clearly, a formal mathematical theory can never 
refer directly to real life, but it should at least provide theoretical 
counterparts to the phenomena which it tries to explain. Accordingly, 
we require that the vague introductory remark be made precise in the 
form of a theorem. For this purpose we translate ‘identical trials” 
as “Bernoulli trials” with probability p for success. If S$, is the num- 
ber of successes in 7 trials, then S,/n is the average number of suc- 
cesses and should be near p. It is now easy to give a precise meaning 
to this. Consider, for example, the probability that S,/n exceeds 
p + ε, where ¢ > 0 is arbitrarily small but fixed. This probability is 
the same as P{S, > n(p + e)} and equals the left side of (3.5) when 
r is the smallest integer exceeding n(p + εὐ. Then (8.5) implies 


nptet q 


(4.1) P{S, > n(p + €)} < W(r; , p) iG 


With increasing n the fraction on the right remains bounded, whereas 
b(r;n,p) — Ὁ since b(r;n, p) < b(k;n, p) for each k such that 
(n+ 1)p <k <1, and there are about ne such terms b(k;n, p). It 
follows that as n increases, P{S, > n(p + «)} - 0. Using formula 
(3.6), we see in the same way that P{S, < n(p — e)} — 0, and we 
have thus 


(4.2) p{|=*-p| «εἰ τοι 


n 


In words: As n increases, the probability that the average number of 
successes deviates from p by more than any preassigned ε tends to zero. 
This is one form of the law of large numbers and serves as a basis for 
the intuitive notion of probability as a measure of relative frequencies. 
For practical applications it must be supplemented by a more precise 
estimate of the probability on the left side in (4.2); such an estimate 
is provided by the normal approximation to the binomial distribution 


142 BINOMIAL POISSON DISTRIBUTIONS [VI.4 


[cf. the typical example VII(8.g)]. Actually formula (4.2) is a simple 
consequence of the latter (problem VII, 18). 

The assertion (4.2) is the classical law of large numbers. It is of very 
limited interest and should be replaced by the more precise and more 
useful strong law of large numbers (see chapter VIII, section 4). 

Warning. It is usual to read into the law of large numbers things 
which it definitely does not imply. If Peter and Paul toss a perfect 
coin 10,000 times, it is customary to expect that Peter will be in the 
lead roughly half the time. This ts not true. The arc sine law (chapter 
III, section 5) states that such an equalization is least probable. The 
probability that Peter leads in less than 20 trials 1s very much larger than 
the probability that the number of trials in which he leads lies between 
4990 and 5010. There does not exist any tendency for the periods of 
lead to equalize. The law of large numbers asserts only that in a large 
number of different coin-tossing games the frequency of those in which 
heads lead is, at any given moment, close to $. Nothing is said about 
the fluctuations of the lead within a fixed game. 


5. THE POISSON APPROXIMATION ὁ 


In many applications we deal with Bernoulli trials where, compara- 
tively speaking, n is large and p is small, whereas the product 


(5.1) A = np 


is of moderate magnitude. In such cases it is convenient to use an 
approximation formula to b(k; 7, p) which is due to Poisson and which 
we proceed to derive. We have 6(0;n, p) = (1 — p)” or, substituting 
from (5.1), 


λ n 
(5.2) b(0;n, p) = (1 -- *) . 
nN 
Passing to logarithms and using the Taylor expansion II(8.10), we find 
λ ? 
(5.3) log b(0;n, p) = nlog{1—-)=-A-—-... 
nN 2n 


so that for large n 
(5.4) b(0;n, p) ~ e, 


where the sign ~ is used to indicate approximate equality (in the pres- 
ent case up to terms of order of magnitude n~'). Furthermore, from 
(3.1) it is seen that for any fixed k and sufficiently large n we have 


6 Siméon D. Poisson (1781-1840). His book, Recherches sur la probabilité des 
jugements en matiére criminelle et en matiére civile, précedées des régles générales du 
calcul des probabilités, appeared in 1837. 
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b(k;n,p)  A-—(k- lp 
δ(5--- 1; ἢ, p) kq 


For k = 1 we get from this and (5.4) b(1;n, p) ~ Xe. For k = 2 
we get b(2;n, p) ~ \e~*/2. Generally we see by induction that 

ne 
(5.6) b(k3n, p) = hi e, 


(5.5) 


ae 


This is the famous Poisson approximation to the binomial distribution. 
(See problems 30-34 for an estimate of the error and a proof that the 
approximation in (5.6) is uniform when n — © and p — 0 in such a 
way that Δ = np remains bounded.) It is convenient to have a symbol 
for the right-hand member in (5.6), and we shall put 

λἢ 
(5.7). p(k;A) = . 
With this notation p(k; Δ) should be an approximation to δί(ζ; n, \/n) 
when 7 is sufficiently large. 


Examples. (a) The entries p,, of the last column of table 3 in 
chapter IV give the values p(m;1). In the preceding columns b,, 
stands for b(m; N,1/N). The table enables us to compare the Poisson 
distribution p(m; 1) with the binomial distributions with p = 1/n and 
n = ὃ, 4, 5, 6, 10. It will be seen that the agreement is surprisingly 
good despite the small values of n. 

(b) Table 2 compares p(k; 1) to the binomial distribution with 


TABLE 2 


AN EXAMPLE OF THE Poisson APPROXIMATION 


k b(k; 100, τὸν) p(k; 1) Ni 
0 0.366 032 0.367 879 4] 
1 .369 730 .367 879 34 
2 .184 865 .183 940 16 
3 .060 999 061 313 8 
4 014 942 015 328 0 
5 .002 898 .003 066 1 
6 .000 463 .000 511 0 
7 .000 063 .000 073 0 
8 000 007 .000 009 0 
9 000 001 00 001 0 


The first columns illustrate the Poisson approximation to the binomial distribu- 
tion. The last column records the number of batches of 100 pairs of random digits 
each in which the combination (7, 7) appears exactly k times. 


QO OOO EE a rrr rry 
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n = 100, p = τί σ. It shows the approximation to be satisfactory for 
many purposes. As an example take the occurrence of the combina- 
tion (7,7) among 100 pairs of random digits, which should have the 
binomial distribution b(k; 100, zy). The last column of table 27 gives 
the actual counts in 100 batches of 100 random digits each. To obtain 
relative frequencies all entries of the last column should be divided by 
100. These frequencies agree reasonably with the theoretical proba- 
bilities. (As judged by the x?-criterion, chance fluctuations should, in 
about 75 out of 100 similar cases, produce larger deviations of observed 
frequencies from the theoretical probabilities.) 

(ὁ Birthdays. What is the probability, p;, that in a company of 
500 people exactly & will have birthdays on New Year’s Day? If the 
500 people are chosen at random, we may apply the scheme of 500 
Bernoulli trials with probability of success p= 3@5. Then 
po = (384)°° = 0.2537.... For the Poisson approximation we 
put »A = $89 = 1.3699..... Then p(0;A) = 0.2541, which in- 
volves an error only in the fourth decimal place. For k = 1, 2, ... 
the correct values of ρα as calculated from the binomial formula are 
ρι = 0.3484..., po = 0.2388..., p3 = 0.1089..., 54 = 0.03872..., 
ps = 0.0101..., pg = 0.0023.... The corresponding Poisson approx- 
imations are p(1;\) = 0.3481..., p(2;) = 0.2385..., p(3;A) = 
= 0.1089..., p(4;d) = 0.0873..., p(5;A) = 0.0102..., p(6;A) = 
= 0.0023.... All errors are in the fourth decimal place. 

(d) Defective items. Suppose that screws are produced under statis- 
tical quality control so that it is legitimate to apply the Bernoulli 
scheme of trials. If the probability of a screw’s being defective is 
p = 0.015, then the probability that a box of 100 screws does not con- 
tain a defective one is (0.985)!°° = 0.22061. The corresponding 
Poisson approximation is 6 1 = 0.22313..., which should be close 
enough for most practical purposes. We now ask: How many screws 
should a box contain in order that the probability of finding at least 
100 conforming screws be 0.8 or better? If 100 + x is the required 
number, then x is a small integer. To apply the Poisson approximation 
forn = 100 + 2 trials we should put A = np, but np is approximately 
100p = 1.5. We then require the smallest integer x for which 


(1.5)? 


1.5 
(5.8) HES tet | > os, 


7™M. G. Kendall and Babington Smith, Tables of random sampling numbers, 
Tracts for Computers No. 24, Cambridge, 1940. 
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In tables ® we find that for x = 1 the left side is approximately 0.56, 
and for x = 2 it is 0.809. Thus the Poisson approximation would 
lead to the conclusion that 102 screws are required. Since 0.809 is 
dangerously near the given threshold of 0.8, the number 108 is safer. 
Actually the probability of finding at least 100 conforming screws in a 
box of 102 is 
102 
(0.985) 10? + ( ; ) (0.985)1°(0.015) + 


102 
a ( : ) (0.985) !9°(0.015)? = 0.8022... 


(e) Centenarians. At birth any particular person has a small chance 
of living 100 years, and in a large community the number of yearly 
births is large. Owing to wars, epidemics, etc., different lives are not 
stochastically independent, but as a first approximation we may compare 
n births to n Bernoulli trials with death after 100 years as success. In 
a stable community, where neither size nor mortality rate changes 
appreciably, it is reasonable to expect that the frequency of years in 
which exactly k centenarians die is approximately p(k; A), with A de- 
pending on the size and health of the community. Records of Switzer- 
land confirm this conclusion.® 

(f) Misprints, raisins, etc. If in printing a book there is a constant 
probability of any letter’s being misprinted, and if the conditions of 
printing remain unchanged, then we have as many Bernoulli trials as 
there are letters. The frequency of pages containing exactly k mis- 
prints will then be approximately p(k; Δ), where ἃ is a characteristic of 
the printer. Occasional fatigue of the printer, difficult passages, etc., 
will increase the chances of errors and may produce clusters of mis- 
prints. Thus the Poisson formula may be used to discover radical 
departures from uniformity or from the state of statistical control. 
A similar argument applies in many cases. For example, if many 
raisins are distributed in the dough, we should expect that thorough 
mixing will result in the frequency of loaves with exactly & raisins to 
be approximately p(k; A) with A a measure of the density of raisins in 
the dough. 


8. C. Molina, Poisson’s exponential binomial limit, New York, 1942. (These 
are tables giving p(k; \) and p(k; A) + p(kK+1;A) +... for k ranging from 0 to 
100.) 5 

9Ἐ), J. Gumbel, Les centenaires, Aktudrske Vedy, Prague, vol. 7 (1937), pp. 1-8. 
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6. THE POISSON DISTRIBUTION 


In the preceding section we have used the Poisson expression (5.7) 
merely as a convenient approximation to the binomial distribution in 
the case of large n and small p. In connection with the matching 
and occupancy problems of chapter IV we have studied different 
probability distributions, which have also led to the Poisson expres- 
sions p(k; Δ) as a limiting form. We have here a special case of the 
remarkable fact that there exist a few distributions of great universality 
which occur in a surprisingly great variety of problems. The three 
principal distributions, with ramifications throughout probability 
theory, are the binomial distribution, the normal distribution (to be 
introduced in the following chapter), and the Povzsson distribution 


(6.1) | p(k;\A) =e “*—) 


which we shall now consider on its own merits. 

We note first that on adding the equations (6.1) fork = 0, 1, 2, ... 
we get on the right side e~ times the Taylor series for οὐ. Hence for 
any fixed ἃ the quantities p(k; Δ) add to unity, and therefore it is pos- 
sible to conceive of an ideal experiment in which p(k; A) is the proba- 
bility of exactly k successes. We shall now indicate why many physical 
experiments and statistical observations actually lead to such an inter- 
pretation of (6.1). The examples of the next section will illustrate the 
wide range and the importance of various applications of (6.1). The 
true nature of the Poisson distribution will become apparent only in 
connection with the theory of stochastic processes (cf. chapter XVII, 
where a new approach to the Poisson distribution is given). 

Consider a sequence of random events occurring in time, such as 
radioactive disintegrations or incoming calls at a telephone exchange. 
Each event is represented by a point on the time axis, and we are 
concerned with chance distributions of points. There exist many dif- 
ferent types of such distributions, but their study belongs to the domain 
of continuous probabilities which we have postponed to the second vol- 
ume. Here we shall be content to show that the simplest physical 
assumptions lead to p(k; A) as the probability of finding exactly & points 
(events) within a fixed interval of specified length. Our methods are 
necessarily crude, and we shall return to the same problem with more 
adequate methods in chapter XVII. 

The physical assumptions which we want to express mathematically 
are that the conditions of the experiment remain constant in time, and 
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that non-overlapping time intervals are stochastically independent in the 
sense that information concerning the number of events in one interval 
reveals nothing about the other. The theory of probabilities in a con- 
tinuum makes it possible to express these statements directly, but being 
restricted to discrete probabilities, we have to use an approximate 
finite model and pass to the limit. 

Imagine the unit time interval divided into a great number ἢ of 
intervals, each of length 1/n. Either a particular subinterval is empty, 
or it contains at least one of our random points (or events), and we 
agree to call the two possibilities failure and success, respectively. The 
probability pn of success must be the same for all n subintervals, since 
they have the same length. The assumed independence of non-over- 
lapping intervals then implies that we have n Bernoulli trials, and the 
probability of exactly k successes is given by b(k;n, pn). Now the 
number of successes is not necessarily the same as the number of ran- 
dom points, since a subinterval may contain several random points. 
However, it is natural to introduce the additional assumption that the 
probability of two or, more random points during a very short time 
interval is in the limit negligible.“ In this case the probability of find- 
ing exactly k random points in the unit time interval is given by the 
limit of b(k; n, pn) as — ©. When we divide each subinterval into 
two parts of equal length, we find that pn = 2pen — Pon”; this equa- 
tion states that success in an interval of length 1/n means either success 
in the left half, or success in the right half, or in both. It follows that 
Yn < 2pen, and this suggests that np, increases monotonically (which 
can be proved rigorously). If np, — ἃ, then b(k;n, pn) ~ b(k;n, A/n) > 
—> p(k;d), and we find (6.1) as the probability that there is a total 
of k random points contained in our unit interval. The assumption 
np, —» © leads to no sensible result, as it would imply infinitely many 
random points even in the smallest interval. 

If, instead of the unit interval, we take an arbitrary interval of 
length ¢ and again use a subdivision into intervals of length 1/n, then 
we have Bernoulli trials with the same probability p, of success, but 
the number of trials is the integer nearest to nt rather than n. The 
passage to the limit is the same, but we get λέ instead of A. This leads 


10 This assumption is implicit in the intuitive picture of isolated random points. 
However, it is necessary to exclude the possibility of our events appearing in dou- 
blets. For example, if the events are automobile accidents, then the probability 
of two events within a short time is negligible in comparison with the probability 
of one event. On the other hand, an accident is likely to involve two cars, and if 
- the events mean “‘a car smashed” then they are likely to appear in pairs and our 
assumption does not apply. : | 
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us to consider 


k 
(6.2) p(k; Mt) = e™ ~ 


as the probability of finding exactly k points in a fixed interval of length t. 
In particular, the probability of no point in an interval of length t is 


(6.3) p(0; At) = e**, 


and the probability of one or more points is therefore 1 — ο΄, 

The parameter ἃ is a physical constant which determines the density 
of points on the f-axis. The larger ἃ is, the smaller is the probability 
(6.3) of finding no point. Suppose that a physical experiment is re- 
peated a great number N of times, and that each time we count the 
number of events in an interval of fixed length ¢. Let N; be the num- 
ber of times that exactly k events are observed. Then 


(6.4) NotNi+Not+...=N. 

The total number of points ne in the N experiments is 

(6.5) Ni + 2Ne + 3N3+...= T, 

and ΤΙΝ is the average. If N is Lees: we expect that 

(6.6) Ny = Notk; dt) 

(this lies at the root of all applications of probability and will be justi- 
fied and made more precise by the law of large numbers in chapter X). 
Substituting from (6.6) into (6.5), we find 

(6.7) T = N{p(l; At) + 2p(2; At) + 3p(3; At) +...} = 


a λέ (At)? 
= Neri 11 + — + — +...7; = MA 
1 2! 
and hence 


6.8) λέ af 
(6. MRS 


This relation gives us a means of estimating ἃ from observations and 
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of comparing theory with experiments. The examples of the next 
section will illustrate this point. 


Spatial Distributions 


We have considered the distribution of random events or points 
along the ¢-axis, but the same argument applies to the distribution of 
points in plane or space. Instead of intervals of length t we have do- 
mains of area or volume ¢, and the fundamental assumption is that the 
probability of finding & points in any specified domain depends only 
on the area or volume of the domain but not on its shape. Otherwise 
we have the same assumptions as before: (1) if tis small, the probability 
of finding more than one point in a domain of volume ἐ is small as 
compared to ἐ; (2) non-overlapping domains are mutually independent. 
To find the probability that a domain of volume ¢ contains exactly k 
random points, we subdivide it into n subdomains and approximate 
the required probability by the probability of k successes in n trials. 
This means neglecting the possibility of finding more than one point 
in the same subdomain, but our assumption (1) implies that the error 
tends to zero as n — ©. In the limit we get again the Poisson distri- 
bution (6.2). Stars in space, raisins in cake, weed seeds among grass 
seeds, flaws in materials, animal litters in fields are distributed in ac- 
cordance with the Poisson law. See examples (7.b) and (7.e). 


7. OBSERVATIONS FITTING THE POISSON 
DISTRIBUTION 5 


(a) Radioactive disintegrations. A radioactive substance emits a-par- 
ticles; the number of particles reaching a given portion of space during 
time ¢ is the best-known example of random events obeying the Poisson 
law. Of course, the substance continues to decay, and in the long run 
the density of a-particles will decline. However, with radium it takes 
years before a decrease of matter can be detected; for relatively short 
periods the conditions may be considered constant, and we have an 
ideal realization of the hypotheses which led to the Poisson distribution. 

In a famous experiment ” a radioactive substance was observed dur- 
ing N = 2608 time intervals of 7.5 seconds each; the number of par- 


11 The Poisson distribution has become known as the law of small numbers or of 
rare events. These are misnomers which proved detrimental to the realization of 
the fundamental role of the Poisson distribution. The following examples will 
show how misleading the two names are. 

1 Rutherford, Chadwick, and Ellis, Radiations from radioactive substances, Cam- 
bridge, 1920, p. 172. Table 3 and the x-estimate of the text are taken from 
H. Cramér, Mathematical methods of statistics, Uppsala and Princeton, 1945, p. 436. 
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TABLE 3 


EXAMPLE (a): RADIOACTIVE DISINTEGRATIONS 


k Ni Nop(k; 3.870) 
0 57 54.399 
1 203 210.523 
2 383 407.361 
3 525 525.496 
4 532 508.418 
5 408 393.515 
6 273 253.817 
7 139 140.325 
8 45 67.882 
9 27 29.189 
k= 10 16 17.075 
Total 2608 2608.000 


ticles reaching a counter was obtained for each period. Table 3 records 
the number N;, of periods with exactly k particles. The total number 
of particles is T = ZkN;, = 10,094, the average T/N = 3.870. The 
theoretical values Np(k; 3.870) are seen to be rather close to the ob- 
served numbers ΝᾺ. To judge the closeness of fit, an estimate of the 
probable magnitude of chance fluctuations is required. Statisticians 
judge the closeness of fit by the x?-criterion. Measuring by this stand- 
ard, we should expect that under ideal conditions about 17 out of 100 
comparable cases would show worse agreement than exhibited in table 3. 

(b) Flying-bomb hits on London. As an example of a spatial distri- 
bution of random points consider the statistics of flying-bomb hits in 


TABLE 4 


EXAMPLE (b): FLYING-BomMB Hits on LONDON 


k 0 1 2 3 4 5 and over 
Νὰ 229 211 93 35 7 1 
Np(k; 0.93823) 226.74 211.39 98.54 30.62 7.14 1.57 


the south of London during World War II. The entire area is divided 
into N = 576 small areas of t = 4 square kilometers each, and table 4 
records the number ΝᾺ of areas with exactly k hits.* The total number 


18 The figures are taken from R. Ὁ. Clarke, An application of the Poisson dis- 
tribution, Journal of the Institute of Actuaries, vol. 72 (1946), p. 48. 
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TABLE 5 


EXAMPLE (c): CHROMOSOME INTERCHANGES INDUCED BY X-RAY 


IRRADIATION 
Experi- Cells with k Interchanges x?- 
ment Total | Level 
Num- N in Per 
ber 0 1 2 >3 Cent 
1 Observed Νὰ 753 266 49 5 1073 95 
No(k; 0.35508) | 752.3 | 267.1 | 47.4 6.2 
2 Observed Νὰ 434 195 44 9 682 85 
Np(k; 0.45601) | 432.3 | 197.1 | 44.9 | 7.7 
3 Observed δ 280 15 12 1 368 65 
No(k; 0.27717) | 278.9 | 77.3 | 10.7 Lat 
4 Observed Nz 2278 273 15 0 2566 65 
Nok; 0.11808) |2280.2 | 269.2 | 15.9 0.7 
5 Observed δ, 593 143 20 3 759 45 
Nop(k; 0.25296) | 589.4 | 149.1 | 18.8 1.7 
6 Observed N; 639 141 13 0 793 45 
Nop(k; 0.21059) | 642.4 | 185.3 | 14.2 i Fe 
7 Observed N; 359 109 13 1 482 40 
Np(k; 0.28631) | 362.0 | 103.6 | 14.9 1.5 
8 Observed N; 493 176 26 2 697 35 
No(k; 0.83572) | 498.2 | 167.3 | 28.1 3.4 
9 Observed Νὰ 793 339 62 5 1199 20 
No(k; 0.39867) | 804.8 | 320.8 | 64.0 9.4 
10 Observed Νὰ 579 254 47 3 883 20 
Nop(k; 0.40544) | 588.7 | 238.7 | 48.4 7.2 
11 Observed Νὰ 444 252 59 1 756 5 


Np(k; 0.49339) | 461.6 | 227.7 | 56.2 | 10.5 
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of hitsis T = ΣΝ = 537, the average λέ = T/N = 0.9323.... The 
fit of the Poisson distribution is surprisingly good; as judged by the 
x?-criterion, under ideal conditions some 88 per cent of comparable 
observations should show a worse agreement. It is interesting to note 
that most people believed in a tendency of the points of impact to 
cluster. If this were true, there would be a higher frequency of areas 
with either many hits or no hit and a deficiency in the intermediate 
classes. Table 4 indicates perfect randomness and homogeneity of the 
area; we have here an instructive illustration of the established fact 
that to the untrained eye randomness appears as regularity or tendency 
to cluster. 

(c) Chromosome interchanges in cells. Irradiation by X-rays pro- 
duces certain processes in organic cells which we call chromosome inter- 
changes. As long as radiation continues, the probability of such inter- 
changes remains constant, and, according to theory, the numbers N;, 
of cells with exactly & interchanges should follow a Poisson distribu- 
tion. The theory is also able to predict the dependence of the param- 
eter ἃ on the intensity of radiation, the temperature, etc., but we shall 
not enter into these details. Table 5 records the result of eleven dif- 
ferent series of experiments.* These are arranged according to good- 
ness of fit. The last column indicates the approximate percentage of 
ideal cases in which chance fluctuations would produce a worse agree- 
ment (as judged by the x”-standard). The agreement between theory 
and observation is striking. 

(d) Connections to wrong number. Table 6 shows statistics of tele- 
phone connections to a wrong number.” A total of N = 267 numbers 
was observed; N; indicates how many numbers had exactly k wrong 
connections. The Poisson distribution p(k; 8.74) shows again an ex- 
cellent fit. (As judged by the x”-criterion the deviations are near the 
median value.) In Thorndike’s paper the reader will find other tele- 
phone statistics following the Poisson law. Sometimes (as with party 
lines, calls from groups of coin boxes, etc.) there is an obvious inter- 
dependence among the events, and the Poisson distribution no longer 


fits. 


“4D, G. Catcheside, D. E. Lea, and J. M. Thoday, Types of chromosome struc- 
tural change induced by the irradiation of Tradescantia microspores, Journal of 
Genetics, vol. 47 (1945-46), pp. 113-136. Our table is table IX of this paper, except 
that the x?-levels were recomputed, using a single degree of freedom. 

16 The observations are taken from F. Thorndike, Applications of Poisson’s prob- 
ability summation, The Bell System Technical Journal, vol. 5 (1926), pp. 604-624. 
This paper contains a graphical analysis of 32 different statistics. 
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TABLE 6 


EXAMPLE (d): CONNECTIONS TO WRONG NUMBER 


k Ni Np(k; 8.74) 
0-2 1 2.05 
3 5 4.76 
4 11 10.39 
5 14 18.16 
6 22 26.45 
7 43 33.03 
8 31 36.09 
9 40 35.04 
10 30 30.63 
11 20 24.34 
12 18 17.72 
13 12 11.92 
14 7 7.44 
1ὅ 6 4.33 
> 16 2 4.65 
267 267.00 


FicurE 1. Bacteria on a Petri plate. 


(6) Bacteria and blood counts. Figure 1 reproduces a photograph of 
a Petri plate with bacterial colonies, which are visible under the micro- 
scope as dark spots. The plate is divided into small squares. Table 7 
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Poisson theor. 
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Poisson theor. 
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Observed Νὰ 
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TABLE 7 


EXAMPLE (e): CoUNTS OF BACTERIA 


[VI.7 
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0 1 
5 19 
6.1 | 18.0 
26 40 
27.5 | 42.2 
59 86 
55.6 | 82.2 
83 134 
75.0 | 144.5 
8 16 
6.8 | 16.2 
7 11 
3.9 | 10.4 
3 7 
2.1 8.2 
60 80 
62.6 | 75.8 


The last entry in each row includes the figures for higher classes and should 
be labeled “ or more.” 


reproduces the observed numbers of squares with exactly k dark spots 
in eight experiments with as many different kinds of bacteria.2 We 
have here a representative of an important practical application of the 


Poisson distribution to spatial distributions of random points. 


16 The table is taken from J. Neyman, Lectures and conferences on mathematical 
statistics (mimeographed), Dept. of Agriculture, Washington, 1938. The original 
(by T. Matuszewsky, J. Supinska, and J. Neyman) appeared, together with related 
material, in Zentralblatt fiir Baktertologie, Parasitenkunde und Infektionskrankheiten, 
II Abt., vol. 95 (1936). 
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8. WAITING TIMES. THE NEGATIVE BINOMIAL 
DISTRIBUTION 


Consider a succession of » Bernoulli trials and let us inquire how 
long it will take for the rth success to turn up. Here r is a fixed posi- 
tive integer. The total number of successes in n trials may, of course, 
fall short of r, but the probability that the rth success occurs at the 
trial number ν < n is clearly independent of » and depends only on 
y,r,and p. Since necessarily v > 1, it is preferable to write ν = k +r. 
The probability that the rth success occurs at the trial number r + k (where 
k = 0,1, ...) will be denoted by f(k;7r, p). It equals the probability that 
exactly k failures precede the rth success. This event occurs if, and only 
if, among the r + k — 1 trials there are exactly k failures and the fol- 
lowing, or (r+-k)th, trial results in success; the corresponding proba- 


k-1 
bilities are (" εἷ ᾿ . p19" and p, so that 
r+k-—-1 
(8.1) fees) = (77 7") ore 


Rewriting the binomial coefficient in accordance with formula II(12.4), 
we find the alternative form 


(8.2) ΚΟ; τ, p) = (} p’(—9)*, k=0,1,2,.... 


Suppose now that Bernoulls trials are continued as long as necessary 
for r successes to turn up. A typical sample point is represented by a 
sequence containing an arbitrary number, k, of letters F and exactly 
r letters S, the sequence terminating by an S; the probability of such 
a point is, by definition, pg”. We must ask, however, whether it is 
possible that the trials never end, that is, whether an infinite sequence 

@ 


of trials may produce fewer than r successes. Now δ᾽ f(k; 7, p) is the 
k=0 

probability that the rth success occurs after finitely many trials; ac- 

cordingly, the possibility of an infinite sequence with fewer than r suc- 


cesses can be discounted if, and only if, 
(8.3) | Dd S(k; τ, p) = 1. 
| : k=0 


To prove that (8.3) holds it suffices to note that by the binomial 
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theorem 


(8.4) Σ, (} (—g)* Ξ (1-9) "=p. 


μον ἢ 


Multiplying (8.4) by p’ we get (8.3). 

In our waiting time problem r is necessarily a positive integer, but 
the quantity defined by either (8.1) or (8.2) is non-negative and (8.3) 
holds for any positive r. For arbitrary fixed real r > Ο and 0 < p «1 
the sequence {f(k;r, p)} is called a negative binomial distribution. It 
occurs in many applications (and we have encountered it in problem 
V, 24 as the limiting form of the Polya distribution). When r is a 
positive integer, {f(k; 7, p)} may be interpreted as the probability dis- 
tribution for the waiting time to the rth success; as such it is also called 
the Pascal distribution. For r = 1 it reduces to the geometric distribu- 


tion {pqr}. 


TABLE 8 


PROBABILITIES (8.5) 


r Ur U,; r Uy Us 
0 0.079 589 0.079 589 15 0.023 171 0.917 941 
1 .079 589 .159 178 16 .019 081 .937 022 
2 078 785 .237 963 17 015 447 952 469 
3 O77 177 .o15 140 18 012 283 .964 752 
4 .074 790 389 931 19 .009 587 .974 338 
5 .071 674 .461 605 20 .007 338 .981 676 
6 .067 902 .529 506 21 .005 504 .987 180 
7 .063 568 .593 073 22 .004 041 .991 220 
8 .058 783 .651 855 23 .002 901 994 121 
9 .053 671 705 527 24 .002 034 .996 155 
10 .048 363 .753 890 25 .001 392 997 547 
11 .042 989 .796 879 26 .000 928 .998 475 
12 .037 676 834 555 27 .000 602 .999 077 
13 .032 538 .867 094 28 .000 379 .999 456 
14 .027 676 894 770 29 .000 232 .999 688 


u, is the probability that, at the moment the first match box is found empty, 
the second contains exactly r matches, assuming that initially each box con- 
tained 50 matches. U, = up + uw +...+ 4; is the corresponding probability 
of having not more than r matches. 
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Example. Banach’s match box problem." A certain mathematician 
always carries one match box in his right pocket and one in his left. 
When he wants a match, he selects a pocket at random, the successive 
choices thus constituting Bernoulli trials with p = 4. Suppose that 
initially each box contained exactly N matches and consider the mo- 
ment when, for the first time, our mathematician discovers that a box 
is empty. At that moment the other box may contain 0, 1, 2, ..., N 
matches, and we denote the corresponding probabilities by u,. Let us 
identify ‘‘success” with choice of the left pocket. The left pocket will 
be found empty at a moment when the right pocket contains exactly 
r matches if, and only if, exactly N — r failures precede the (V-+1)st 
success. The probability of this event is f(N—r;N-+1, 4). The same 
argument applies to the right pocket and therefore the required prob- 
ability is 


2N — 
(8.5) Ur = 2f(N—r;N+1, 4) = ( i ῥ 7 aaa 


Numerical values for the case N = 50 are given in table 8. [Cf. prob- 
lems 21-23 and example IX(3.f).] 


9. THE MULTINOMIAL DISTRIBUTION 


The binomial distribution can easily be generalized to the case of n 
repeated independent trials where each trial can have one of several 
~ outcomes. Denote the possible outcomes of each trial by £1, ..., E,, 
and suppose that the probability of the realization of EZ; in each trial 
is p; (ἢ = 1,...,7r). For r = 2 we have Bernoulli trials; in general, 
. the numbers p; are subject only to the condition 


(9.1) Pt...+p, = 1, p= 0. 


The result of n trials is a succession like H3E,E2.... The probability 
that in n trials EB, occurs k, tumes, E2 occurs ke times, etc., is 


(9.2) mi pi'pepg"s +++ p,*r; 

here the k; are arbitrary non-negative integers subject to the obvious con- 
dition 

(9.3) A th+...tkh, =n. 

If r = 2, then (9.2) reduces to the binomial distribution with p; = p, 


17 Communicated by H. Steinhaus. 


168 BINOMIAL POISSON DISTRIBUTIONS [V1.9 


Po = 4, ki = k, ko =n —k. The proof in the general case proceeds 
along the same lines, starting with formula IT(4.7). 

Formula (9.2) is called the multinomial distribution because the right- 
hand member is the general term of the multinomial expansion of 
(pi +...+p,)”. Its main application is to sampling with replacement 
when the individuals are classified into more than two categories (e.g., 
according to professions). 


Examples. (a) In rolling twelve dice, what is the probability of get- 
ting each face twice? Here H, ..., Hg represent the six faces, all k; 
equal 2, and all p; equal. Therefore, the answer is (12!)(2)—®(6)—!2 = 
= 0.0034.... 

(0) Sampling. Let a population of N elements be divided into sub- 
classes 1, ..., H, of sizes Np, ..., Np,. The multinomial distribu- 
tion gives the probabilities of the several possible compositions of a 
random sample with replacement of size n taken from this population. 

(c) Multiple Bernoulli trials. Two sequences of Bernoulli trials with 
probabilities of success and failure p1, 41, and po, ge, respectively, may 
be considered one compound experiment with four possible outcomes 
in each trial, namely, the combinations (Ὁ, 5), (S, F), (Ρ, 5), (F, F). 
The assumption that the two original sequences are independent is 
translated into the statement that the probabilities of the four out- 
Comes ΘΙ P1P2, P192, 41}2, 9192, respectively. If k1, ko, kg, Κα are four 
integers adding to n, the probability that in n trials SS will appear k, 
times, SF ke times, etc., is 


η 
(9.4) klk Welk! ΝΣ ΠΣ thsgke thy, 
1'kolkg!k,! 


A special case occurs in sampling inspection. An item is conforming or 
defective with probabilities p and g. It may or may not be inspected 
with corresponding probabilities p’ and gq’. The decision of whether 
an item is inspected is made without knowledge of its quality, so that 
we have independent trials. (Cf. problems 25, 26, and IX, 12.) 


10. PROBLEMS FOR SOLUTION 


1. Assuming all sex distributions to be equally probable, what proportion 
of families with exactly six children should be expected to have three boys and 
three girls? 

2. A bridge player had no ace in three consecutive hands. Did he have 
reason to complain of ill luck? 

3. How long has a series of random digits to be in order for the probability 
of the digit 7 appearing to be at least 7%? 
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4, How many independent bridge dealings are required in order for the 
probability of a preassigned player having four aces at least once to be 4 or 
better? Solve again for some player instead of a given one. 


5. If the probability of hitting a target is + and ten shots are fired independ- 
ently, what is the probability of the target’s being hit at least twice? 

6. In problem 5, find the conditional probability of the target’s being hit at 
least twice, assuming that at least one hit is scored. 

7. Find the probability that a hand of thirteen bridge cards selected at ran- 
dom contains exactly two red cards. Compare it with the corresponding proba- 
bility in Bernoulli trials with p = 3. (For a definition of bridge see footnote 1, 
chapter I.) 


8. What is the probability that the birthdays of six people fall in two calen- 
dar months leaving exactly ten months free? (Assume independence and 
equal probabilities for all months.) 


9. In rolling six true dice, find the probability of obtaining (a) at least one, 
(δ) exactly one, (c) exactly two, aces. Compare with the Poisson approxima- 
tions. 


10. If there are on the average 1 per cent left-handers, estimate the chances 
of having at least four left-handers among 200 people. 


11. A book of 500 pages contains 500 misprints. Estimate the chances that 
a given page contains at least three misprints. 


12. Colorblindness appears in 1 per cent of the people in a certain popula- 
tion. How large must a random sample (with replacements) be if the proba- 
bility of its containing a colorblind person is to be 0.95 or more? 


13. In the preceding exercise, what is the probability that a sample of 100 
will contain (a) no, (δ) two or more, colorblind people? 


14. Estimate the number of raisins which a cookie should contain on the 
average if it is desired that the probability of a cookie to contain at least one 
raisin be 0.99 or more. 


15. The probability of a royal flush in poker is p = 1/649,740. How large 
has n to be to render the probability of no royal flush in n hands smaller than 
1/e ~ 3? (Note: No calculations are necessary for the solution.) 


16. A book of n pages contains on the average \ misprints per page. Esti- 
mate the probability that at least one page will contain more than k misprints. 


17. Suppose that there exist two kinds of stars (or raisins in a cake, or flaws 
in a material). The probability that a given volume contains 7 stars of the 
first kind is p(j; a), and the probability that it contains ἃ stars of the second 
kind is p(k; b); the two events are assumed to be independent. Prove that 
the probability that the volume contains a total of n stars is p(n; a + δ). (In- 
terpret the assertion and the assumptions abstractly.) 


18. A traffic problem. The flow of traffic at a certain street crossing is 
described by saying that the probability of a car’s passing during any given 
second is a constant p; and that there is no interaction between the passing 
of cars at different seconds. Treating seconds as indivisible time units, the 
model of Bernoulli trials applies. Suppose that a pedestrian can cross the 
street only if no car is to pass during the next three seconds. Find the proba- 
bility that the pedestrian has to wait for exactly k = 0, 1, 2, 3, 4 seconds. 
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(The corresponding general formulas are not obvious and will be derived in 
connection with the theory of success runs in chapter XIII, section 7.) 

19. Two people toss a true coin n times each. Find the probability that 
they will score the same number of heads. 

20. In a sequence of Bernoulli trials with probability p for success, find the 
probability that a successes will occur before 6 failures. (Note: The issue is 
decided after at most a + ὃ — 1 trials. This problem played a role in the 
classical theory of games in connection with the question of how to divide 
the pot when the game is interrupted at a moment when one player lacks a 
points to victory, the other ὃ points.) 

21. In Banach’s match box problem (section 8) find the probability that at 
the moment when the first box is emptied (not found empty) the other contains 
exactly r matches (where r = 1, 2, ..., N). 

22. Continuation. Using the preceding result, find the probability x that 
the box first emptied is not the one first found to be empty. Show that the 


expression thus obtained reduces to x = Es 2-2N-1 or 2(Nar)—3, approxi- 
mately. 

93. Proofs of a certain book were read independently by two proofreaders 
who found, respectively, k; and ke misprints; ki2 misprints were found by both. 
Give a reasonable estimate of the unknown number, n, of misprints in the 
proofs. (Assume that proofreading corresponds to Bernoulli trials in which 
the two proofreaders have, respectively, probabilities p1 and p2 of catching a 
misprint. Use the law of large numbers.) 

Note: The problem describes in simple terms an experimental setup used 
by Rutherford for the count of scintillations. 

24. To estimate the size of an animal population by trapping,'® traps are 
set r times in succession. Assuming that each animal has the same probability 
g of being trapped; that originally there were n animals in all; and that the 
only changes in the situation between the successive settings of traps are that 
animals have been trapped (and thus removed); find the probability that 
the r trappings yield, respectively, m1, m2, ..., Nr animals. 

95. Multiple Bernoulli trials. In example (9.c) find the conditional proba- 
bilities p and q of (8, F) and (F,S), respectively, assuming that one of these 
combinations has occurred. Show that p > 3 or p < 3, according as pi > Pp» 
Or Po > P1. 

96. Continuation.» If in n pairs of trials exactly m resulted in one of the 
combinations (S, F) or (F, 8), show that the probability that (S, F) has occurred 
exactly & times is b(k; m, p). 

27. Combination of the binomial and Poisson distributions. Suppose that the 


18 P, A. Ρ. Moran, A mathematical theory of animal trapping, Biometrika, vol. 
38 (1951), pp. 307-311. 

19 A, Wald, Sequential tests of statistical hypotheses, Annals of Mathematical 
Statistics, vol. 16 (1945), p. 166. Wald uses the results given above to devise a 
practical method of comparing two empirically given sequences of trials (say, the 
output of two machines), with a view of selecting the one with the greater prob- 
ability of success. He reduces this problem to the simpler one of finding whether 
in a sequence of Bernoulli trials the frequency of success differs significantly from 3. 
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probability of an insect’s laying r eggs is p(r; A) and that the probability of an 
egg’s developing is p. Assuming mutual independence of the eggs, show that 
the probability of a total of & survivors is given by the Poisson distribution 
with parameter \p. | 

Note: Another example for the same situation: the probability of & chromo- 
some breakages is p(k;A), and the probability of a breakage healing is p. 
(For additional examples of a similar nature see [X(1.d) and chapter XII, 
section 1.) 


28. Prove the theorem: 39 The maximal term of the multinomial distribution 
(9.2) satisfies the inequalities 


(10.1) npi-l<k<(nm+r— 1)ρι, ae ba Ropers ἴς 
(Hint: Prove first that the term is maximal if, and only if, pik; < pki + 1) 
for each pair (i,j). Add these inequalities for all 7, and also for all 1 ¥ 7.) 


99. The terms p(k; Δ) of the Poisson distribution reach their maximum 
when k is the largest integer not exceeding λ. 


Note: Problems 30-34 refer to the Poisson approximation of the binomial distribu- 
tion. It is understood that \ = np, and m is the largest integer not exceeding (n + 1)p 
(that is, m is the index of the central term of the binomial distribution). 


30. Show that as k goes from 0 to ~ the ratios a, = b(k; n, p)/p(k; A) first 
increase, then decrease, reaching their maximum for k = m. 

31. As k increases, the terms b(k; ἢ, p) are first smaller, then larger, and 
then again smaller than p(k; X). 

32. If n — οὐ and p — 0 50 that np = A remains constant, then 


b(k; n, p) — plk;A) 
uniformly for all k. 
33. Show that 


won (0-22 aman 2B (1-2) Ὁ - δ 


34. Conclude from (10.2), using the inequalities II(8.12), that 
(10.3) p(k; er" > Bk; n, p) > p(k; AjeRI—H MID), 


Note: Although (10.2) is very crude, the inequalities (10.3) provide excellent 
error estimates. It is easy to improve on (10.3) by calculations similar to 
those used in chapter II, section 9. Incidentally, using the result of problem 
30, it is obvious that the exponent on the left in (10.3) may be replaced by 
mi\/n which is <(p + n—)d. 


Further Limit Theorems 
35. Binomial approximation to the hypergeometric distribution. A population 
of N elements is divided into red and black elements in the proportion p:q 


20 In the first edition it was only asserted that |k; — npi| <r. The present im- 
provement and its elegant proof are due to P. A. P. Moran. 
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(where p +g = 1). A sample of size n is taken without replacement. The 
probability that it contains exactly k& red elements is given by the hyper- 
geometric distribution of chapter II, section 6. Show that as N — o this 
probability approaches b(k; n, }). 

36. In the preceding problem let p be small, n large, and = np of moderate 
magnitude. The hypergeometric distribution can then be approximated by 
the Poisson distribution p(k; A). Verify this directly without using the binomial 
approximation. 


37. In the negative binomial distribution { f(k;r, p)} of section 8 let g — 0 
and r — oo in such a way that rq = A remains fixed. Show that 


f(k;7, p) — plk;d). 


(Note: This provides a limit theorem for the Polya distribution; cf. problem 
V, 24.) 

38. Multiple Poisson distribution. When n is large and np; = λ; is moderate, 
the multinomial distribution (9.2) can be approximated by 


4x5 ea), i a, Pica 


e~ Ait: ahs a 9 
kilko!---k,! 


Prove also that the terms of this distribution add to unity. (Note that prob- 
lem 17 refers to a double Poisson distribution.) 
39. (a) Derive (3.6) directly from (3.5) using the obvious relation 
b(k; n, p) = b(n—k; n, 4). 


(b) Deduce the binomial distribution both by induction and from the general 
summation formula [V(38.1). 

40. Prove Dkb(k; n, p) = np, and Zk*b(k; n, p) = πῦρ" + npg. 

41. Prove Dk’p(k;A) = A? +A. 

42. Verify the identity 


k 
(10.4) >, δίν; πα, p)b(k — ν; n2, p) = bk; τὰ + ne, p) 
y==0 


and interpret it probabilistically. Hint: Use II(6.4). 


Note: Equation (10.4) is a special case of convolutions, to be introduced in 
chapter XI; (10.5) is another example. 
43. Verify the identity 


k 
(10.5) Σ, p(v; A1)p(k -- ν; Ae) = p(k; λι + Ag) 
44, Let 
k 
(10.6) Bik; n, p) = 2, ὃν; Nn, Pp) 
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be the probability of at most & successes in n trials, Then 
(10.7) Bik; n + 1, p) = Bk; n, p) — pbk; Nn, p), 

Bk+1; +1, p) = Bh; n, p) + gb(k+1; n, p). 


Verify this (a) from the definition, (Ὁ) analytically. 
45. With the same notation 


(10.8) B(k;n, p) = (n — k) () [ rk] -- δὰ 
and 
(10.9) 1 — Bik; n, p) =n ("> ‘) [ "HL -- δ5-Ἀ- αἱ, 


Hint: Integrate by parts or differentiate both sides with respect to p. De- 
duce one formula from the other. 

Note: The integral in (10.9) is the incomplete beta function. Tables of 
1 -- Bk; n, p) to 7 decimals for & and n up to 50 and p = 0.01, 0.02, 0.03, 
are given in K. Pearson, Tables of the incomplete beta function, London (Bio- 
metrika Office), 1934. 


46. Prove 
(10.10) (03d) +... pind) = = [ema de 
Tr! Jy 


Note: In the following problems we give an upper bound for all the terms of the 
binomial distribution. The calculations are quite simple and the method can be im- 
proved to give the simplest derivation of the DeMowvre-Laplace limit theorem (cf. 
problems VII, 19-21). Put for abbreviation 


k—-(n+1)p+q4 
{(n + 1)»4ᾳ}} 


and let m be the index of the central term; that 1s, m 1s the integer satisfying (3.2). 


(10.11) i = 


47. Prove that for r > (n + 1)p 
(10.12) b(r; n, p) < b(m; n, p)-e~ tre + 


where 6=m—(n+1)p+4 whence [ὃ] < αὶ 

Hint: Rewrite (3.1) in the form 

b(k;n,p) _ (n+ 1ὴρᾳ -- (ἢ -- (πα -᾿ ΠΡ})Ρ 
bk --ἴ;π,}} (n+ lpg+ (κ -- (πὰ -Ῥ lp}q 


Conclude that for k > (n + 1)» 


ui nD k—(n+lp ) ~_ k—-(+i)p 
(10.14) 108 Fe ας; τ, Ὁ) + Ὥρα | < (w+ pg 


whence the assertion follows by summation. 

48. For r< (n + 1)p the inequality (10.12) holds with the factor p in the 
sere replaced by g. Hence, if p is replaced by pq, the inequality holds 
for all r. 


(10.13) 


< log {1 - 


CHAPTER VII 


The Normal Approximation 


to the Binomial Distribution 


1. THE NORMAL DISTRIBUTION 


In order to avoid later interruptions we pause here to introduce two 
functions of great importance. 


Definition. The function defined by 


1 
(1.1) (x) = ΣῈ e732 


as called the normal density function; its integral 

(1.2) &(x) = ae [ ΤΕΣ dy 
(2m) .. 

as the normal distribution function. 


The graph of ¢(x) is the symmetric, bell-shaped curve shown in 
figure 1. Note that different units are used along the two axes: The 
maximum of (x) 15 (27)~? = 0.399, approximately, so that in an ordi- 
nary Cartesian system the curve y = ¢(x) would be much flatter. 


Lemma 1. The domain bounded by the graph of o(x) and the x-axis 
has unit area, that is, 


+00 
(1.3) f d(x) dx = 1. 
Proof. We have 


(1.4) [ἢ 4} = [ [Ὥνω dx dy = 


1 +0 ~-+0 
ἘΝ} [ ee) de ἀν. 


27 J ὦ “-ω 
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(x) 


-2 πὸ ~1—067 0 O67 1 2 3 


50% of area 
68.3% of area 
95.6% of area 
99.7% of area 


“Figure 1. The normal density function. 


- - “1 -0.67 0 0671 2 3 


FicurE 2. The norma! distribution function. 
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This double integral can be expressed in polar coordinates thus: 


1 27 0 00 ere) 
(1.5) — f dé f er dr = f et"r dr = —e~*"| = 1 
2π Jo 0 0 0 


which proves the assertion. 

It follows from the definition and the lemma that (x) increases 
steadily from 0 to 1. Its graph (figure 2) is an S-shaped curve with 
(1.6) Φί(--α) = 1 — Φ(). 


Table 1 gives the values ! of (x) for positive x, and from (1.6) we get 
Φί --2). 

For many purposes it is convenient to have an elementary estimate 
of the “tail,” 1 — Φ(), for large x. Such an estimate is given by 


Lemma 2.2 Asx — οὦ 


—}277 , 
e 


(1.7) 1 — Φ() ~ (Omir 


more precisely, for every x > 0 the double inequality 


ee ‘ 1 


1 git: 1 
(1.8) —— ¢* : - τ «1 -- Φ() « 
7 


(2π)} x (27)3 


holds (cf. problem 1). 
Proof. By differentiation we may verify that 


(1.9) ee fo {1 oe d 
8 SS μοι, sr ay. 
Qnt τῳ. (2nd, ye) “Ὁ 


The integrand on the right side is greater than the integrand of 
(1.10) i260 [ id 
— Φ() = — ] e- : 
(2x)? x ᾿ 


1 For larger tables cf. Tables of probability functions, vol. 2, National Bureau of 
Standards, New York, 1942. There ¢(z) and (1) — ®(—z) are given to 15 deci- 
mals for z from 0 to 1 in steps of 0.0001 and for z > 1 in steps of 0.001. 

2 Here and in the sequel the sign ~ is used to indicate that the ratio of the two 


sides tends to one. 
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Po oe te ὯΝ ΘΚ SN ee oe ΘῈ ΘΙΩ NSN μα Se SSS ΘΘΘΘΟ ΟΘῶθς . 
WHE S OWONAT τὰ Θ᾽ COND PWHHOS COND RWNHHOS DONDE PWNHEO 


_ 
or ὦ. 


TABLE 1 


o(t) 


0.398 942 
396 952 
.391 043 
381 388 
368 270 


302 065 
330 225 
312 254 
.289 692 
.266 085 


.241 971 
217 852 
.194 186 
.171 369 
149 727 


129 518 
110 921 
-094 049 
078 950 
.065 616 


053 991 
043 984 
.035 475 
028 327 
.022 395 


017 528 
013 583 
010 421 
007 915 
-005 953 


.004 432 
.003 267 
002 384 
001 723 
001 232 


.000 873 
-000 612 
000 425 
000 292 
-000 199 


-000 134 
-000 089 
.000 059 
-000 039 
-000 025 


000 016 
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H(t) 


0.500 000 
039 828 
579 260 
617 911 
655 422 


691 462 
125 747 
.758 036 
788 145 
815 940 


841 345 
864 334 
.884 930 
-903 200 
-919 243 


933 193 
945 201 
955 435 
964 070 
971 283 


977 250 
982 136 
986 097 
989 276 
-991 802 


993 790 
995 339 
996 533 
997 445 
998 134 


998 650 
999 032 
999 313 
999 517 
999 663 


.999 767 
999 841 
-999 892 
999 928 
999 952 


.999 968 
.999 979 
-999 987 
.999 991 
999 995 


999 997 


167 


108 NORMAL APPROXIMATION [Ν11.1 


which proves the second inequality in (1.8). The first inequality fol- 
lows in the same way, using as new integrand οὖν {1 — 3/y*} which 
is smaller than e~™ . 


Note on Terminology. The term distribution function is used in the mathe- 
matical literature for never-decreasing functions of z which tend to Ὁ as 
Φ - —o,andtolasz — ~. Statisticians currently prefer the term cumulative 
distribution function, but the adjective “cumulative” is redundant. <A density 
function is a non-negative function f(z) whose integral, extended over the entire 
z-axis, is unity. The integral from — to x of any density function is a distribu- 
tion function. The older term frequency function is a synonym for density function. 

The normal distribution function is often called the Gaussian distribution, but 
it was used in probability theory earlier by DeMoivre and Laplace. If the origin 
and the unit of measurement are changed, then (x) is transformed into Φ(( — a)/b); 
this function is called the normal distribution function with mean a and variance 
2 (or standard deviation |b|). The function 26(22?) — 1 is often called error 
function. 


2. THE DeMOIVRE-LAPLACE LIMIT THEOREM 


Let S, stand for the number of successes in n Bernoulli trials with 
probability » for success. Then b(k;n,p) is the probability of the 
event that S, = k. In practice we are usually interested in the prob- 
ability of the event that the number of successes lies between preassigned 
limits a and β. If a and β are integers and a < β, then this event is 
defined by the inequality a < S, < 8, and its probability is 


(2.1) Pla « 5, < B} = bla;n, p) + D(a+1;n, p) +...+ δίβ; , p). 


This sum may involve many terms, and a direct evaluation is usually 
impractical. Fortunately, whenever 7 is large, the normal distribution 
function can be used to derive simple approximations to the probability 
(2.1). This discovery is due to DeMoivre? and Laplace.* We shall 
see that its importance goes far beyond the domain of numerical cal- 
culations. 

Our first aim is to derive an asymptotic formula for the individual 
terms 

2.2 b(k; Ξρςς Tae 

(2.2) GaP π πα ΞΡ" 
The probability p will be kept fixed, but we shall let n — . Accord- 
ing to the law of large numbers [VI(4.2)], the probability that 
IS, — np|> ne tends to zero for each e > 0, and therefore only values 


3 Abraham DeMoivre (1667-1754). His The doctrine of chance appeared in 1718. 
4 Pierre S. Laplace (1749-1827). His Théorie analytique des probabilités appeared 
in 1812. 
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of k such that |k — np|n~' — 0 present a problem. It is now con- 
venient to introduce the new variable 6, = k — np. Then 


(2.3) = np + br, n—k=nq— &, 


and we are interested only in combinations n, k such that n — o and 
5;,/ n - 0. 

Expressing the factorials in (2.2) by means of Stirling’s formula 
II(9.1), we get ὅ 


n 1 /np\* ( nq \"* 
aa) wind le ταὶ G) GAG) = 


n ὃ 1 
7 Ξ: + δι)ύιᾳ -- =i (1 + 5;/np)?? δι — 6,/ng)"2 τ δὰ 


where the sign ~ indicates that the ratio of the two sides tends to unity. 

To evaluate the last fraction we pass to logarithms. In the interval 
| 5.|< npg we may use Taylor’s expansion II(8.9) and find for the 
logarithm of the denominator 


(2.5) (np + δι) log (1 + 6./np) + (ng — δι) log (1 — δι ηᾳ) = 


+5) (= ne ee ee ) 
= (n —-—.— + -—~ - +...) - 
tae np 2n*p* = 3n?p? 


i) (= + ae 
a Pe ng  2nq? = Bee | 


Reordering the terms according to powers of 6,, we get 


(2.6) ἘΓ 4. amb 5) + = 
e Qn p q θη p” g? ees ™ 
5,7 -- 5 
ΜΕ. 18 ΠΡΟΣ 
2npq 37g n 


Here the term of 6,7/2npq is dominant, since 6,/n — 0. If we sup- 
pose that 6,2/n? — 0, then all terms in (2.6) except the first tend to 


5 It will be recalled that in chapter II we did not complete the proof of Stirling’s 
formula but showed only that r! ~ Cr’+4e—", where C is a positive constant. In 
the text it is assumed that C = (27)?. If we want to prove this fact, then the fac- 
tor (2)? in equations (2.4), (2.7), and (2.8) must be replaced by C. In this case 
a factor C’-(2x)—? must be inserted on the right sides in (2.11), (2.14), and (2.18). 
To show that this factor really equals 1 it suffices to choose zg and -- σὰ very large. 
The right side in the modified equation (2.18) is then arbitrarily near to C-(27)74, 
and the left side is near 1 by the estimates of chapter VI, section 3. 
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zero, and (2.4) takes on the simpler form 

nN 
2r(np + δε)ίηᾳ — bx) 


However, np + δὲ ~ np and ng — δὲ ~ nq, and so (2.7) may be fur- 
ther simplified to 


1 , 1 Ok 
(28) (kn, p) ~ ————— eine = δος ( ) 
[2πΉ}ᾳ}" (npg)? \(npq)* 


This is the desired asymptotic formula. We simplify it by the use 
of a more convenient notation. Put 


3 
ex? /2npg_ 


(2.7) blk; n, p) ~ 


1 
(2.9) h= 
(npq)* 
and define a function 2x; of the variable & by 
δὰ 

(2.10) x, = (k — np)h = . 

(npq)* 
In terms of these quantities we can rewrite (2.8) in the form 
(2.11) b(k; n, p) ~ ho(ax). 


To derive this formula we had to suppose that n — © andk -Ξ ὦ 
in such a way that διη ἢ — 0 and also 6,°n~? — 0. The last condi- 
tion obviously implies the first and is the same as 2;,2n~! - 0. We 
have thus 


‘Theorem 1, Ifn — »~andk — ~ in such a way that x,3n* — 0, 
then (2.11) holds. More precisely, we have shown that there exist two 
constants A and B such that 
b(k; n, Α Blzksl 
eee) —] | <—-+ ae, ἑ 
ho(xx) n ni 
(For an alternative for (2.11) see problems 19 and 21.) 


(2.12) 


Figure 3 illustrates the theorem in the case n = 10, p = 0.2 where 
npq is only 1.6. It is seen that even in this extremely unfavorable case 
the approximation is surprisingly good.® 


6 The values of b(k; 10, 0.2) for k = 0, 1, ..., 6 are 0.1074, 0.2684, 0.3020, 0.2013, 
0.0880, 0.0264, 0.0055. The corresponding approximations hd(z,) are 0.0904, 
0.2307, 0.3154, 0.2307, 0.0904, 0.0189, 0.0021. 
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aw GUD cue Gale ae ae Gee eee cee em oe ee ee oo ce ee oe oe eee ae oe ἀν 


Sn 


0 1 2 3 4 5 6 


Figure 3. The normal approximation to the binomial distribution. The step 
function gives the probabilities b(k; 10, 4) of k successes in ten Bernoulli trials with 
p = +. The continuous curve gives for each integer k the corresponding normal 
approximation. 


Our theorem leads directly to simple approximations for the sum 
(2.1). Tf 


(2.18) σα +0 and ha,*® — 0, 

then (2.11) holds uniformly for all terms in (2.1), and therefore 

(2.14) Pla « 8, < B} ~ h{d@a) + O(a41) +... + (x8) }. 

The right side is a Riemann sum approximating an integral’ and we 
proceed to investigate the goodness of this approximation. 


By the mean value theorem there exists a value & such that 


(2.15) B(tn 44) — B(te-y) = ho(Ee), σε — GA < be «ἂρ + Fh. 


7It is clear that Φίακ.) — Φίακ-.3) represents the area of the trapezoid with 
basis 1 — 3h « «τ + 4h and bounded above by the tangent to the curve 
y = $(x) atx = 2 and h¢(2;) represents the area of a rectangle with the same basis. 
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Then 
(2.16) h(a) = οὐ &—**)  Φίας 13) — b(ap_y)} - 


Choose an arbitrary e > 0. If (2.13) holds, then for alla <k< 8 
and n sufficiently large 


Σ[ξι — wn?| = $/& — vel - [Ee + ἀρ] < ἈΠπ|-Ὁ ih] < « 
and hence 
(2.17) ὁ ἰφΦίκ) — Blee_y)} < hd(ae) < e8{B(xe43) — B(te_y)}. 


Adding over k, we see that the ratio of the right-hand member in 
(2.14) to Φίαρ.μᾳ) — Φίαα-.) tends to one. We have thus proved the 


DeMoivre-Laplace Limit Theorem. Jf a and β vary so that 
hz,° — 0 and hzg? — 0, then 


(2.18) Pia<S,< B} ~ P(rp 44) — Φίχα-ἢὉ), 


where h = (npq)™ and x; = (t — np)h. In words, the percentage dif- 
ference between the two sides in (2.18) tends to zero together with hag? 
and hx,°. 


In particular, (2.18) holds if ἃ and β are restricted to values for 
which xz, and xg remain within a fixed interval. (The case where a 
and 6 are so large that the condition (2.13) is not satisfied will be 
discussed in section 5 and in problem 14.) 

In statistical applications (2.18) is usually used in a range in which 
[σαὶ and |xg| do not exceed 3 or 4. In theoretical applications it is 
often necessary to use (2.18) for intervals (a, 8) which are far off the 
central part of the binomial distribution and for which both x, and 2 
are large. In such cases both sides of (2.18) are small, and it becomes 
important to know that their ratzo is near unity as well as that their 
difference tends to zero. 

The limit theorem (2.18) takes on a simpler form if, instead of Sy, 
we introduce the reduced number of successes defined by 


5, — np 
(npq)? 


This amounts to measuring the deviations of S, from np in units of 
(npq)?. The quantity np is called the mean, and (npq)* the standard 
deviation of S,; this terminology 15 suggested by the theory of random 
variables (cf. chapter IX). The inequality a < S, < 6 is the same as 


(2.19) S,* = 
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χα <Sn* < xg, and (2.18) states that for arbitrary fixed χὰ < 2% 


| h h 
(2.20) Ρίχα <S,* < xp} ~ ® (ω + ;) - φ(ο, -- ;): 


where h = (npqg)"?. Now ἃ — 0 as n — οὔ, and therefore the right 
side tends to ®(%g) — ®(xq). Thus we have the following 


Corollary to the Limit Theorem. For every fixed a < ὃ 
(2.21) Pia <S,* <b} — &(0) — Pa). 


This is a weakened version of (2.18) but represents the traditional 
form of Laplace’s limit theorem. The dropping of h/2 in (2.20) intro- 
duces an error which tends to zero as n — © but has a considerable 
influence when npg is of moderate magnitude [as is the case in the 
three examples 3(a)-(c)]. 

The main fact revealed by (2.21) is that for large n the probability 
on the left is practically independent of p. This permits us to compare 
fluctuations in different series of Bernoulli trials simply by referring to 
our standard units. 


Theorem (2.21) is historically the first limit theorem of probability. From a 
modern point of view it is only an exceedingly special case of the central limit 
theorem, to which we shall return in chapter X but whose general derivation must 
be postponed to the second volume. Statisticians use (2.21) as an approximation 
even where npq is relatively small, and in such cases an estimate of the error is de- 
sired. It turns out that in most cases the error in (2.11) is small as compared to 
the error committed by replacing the sum in (2.14) by the integral. (Fortunately 
this error can be avoided by the use of the Euler-MacLaurin summation formula.) 
Serge Bernstein devoted a series of papers to the investigation of the error term in 
the general case and discussed how the definition of x; should be modified in order 
to improve the convergence in (2.18). His papers are written in Russian and are 
difficult to obtain. A simplified derivation with an improvement of his results is, 
however, available in English.’ 


Note on Optional Stopping 


It is essential to note that our limit and approximation theorems are 
valid only if the number n of trials is fixed in advance independently of 
the outcome of the trials. If a gambler has the privilege of stopping at 
a moment favorable to him, his ultimate gain cannot be judged from 
the normal approximation, for now the duration of the game depends 
on chance. For every fixed n it is very improbable that S,,* is large. 
However, in the long run, even the most improbable thing is bound to 


8 W. Feller, On the normal approximation to the binomial distribution, Annals 
of Mathematical Statistics, vol. 16 (1945), pp. 319-329. 
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happen, and we shall see that in a continued game 5, is practically 
certain to have a sequence of maxima of the order of magnitude (log 
log 7)? (this is the law of the iterated logarithm of chapter VIII, sec- 
tion 5). 

3. EXAMPLES 


(a) Let p = $,n = 200,a = 95,8 = 105. Here P{95 < S, < 105} 
may be interpreted as the probability that in 200 tossings of a coin 
the number of heads deviates from 100 by at most 5. We have 
ἢ = (50)? = 0.141421... and —ag_4 = 244 = (5.5)h = 0.7778.... 
From tables we get ®(vgi3) — @(%.-3) = 0.56331.... The true value 
(again obtainable from tables) is 0.56325.... The error is ridiculously 
small, but only because of the accident that in the interval in question 
the integral overestimates the sum in (2.14) and the approximation 
(2.11) underestimates each term. 

(b) Let p = τίς, n = 500, a = 50, β = 55. The correct value is 
P{50 «8, < 55} = 0.317573.... Now ἢ = (45)7? = 0.1490712..., 
and we get the approximation ©(5.5h) — ®(—0.5h) = 0.3235.... The 
error is about 2 per cent. 

(c) Let n = 100, p = 0.3. Table 2 shows in a typical example (for 
relatively small 7) how the normal approximation deteriorates as the 
interval (a, 8) moves away from the central term. 


TABLE 2 


COMPARISON OF THE BINOMIAL DISTRIBUTION FOR n = 100, p = 0.3 
| AND THE NORMAL APPROXIMATION 


Number of Normal Ap- Percent- 

Successes Probability proximation age Error 
9<S,< 11 0.000 006 0.000 03 -+-400 
12< 8, < 14 000 15 .000 33 +100 
15 “ 5, Ξ 17 002 01 .002 83 +40 
18< 5, < 20 .014 30 .015 99 +12 
21< S, < 23 .059 07 .058 95 0 
24 « 8, < 26 .148 87 .144 47 —3 
27 «“ S, < 29 .237 94 234 05 —2 
31< S, 88 .230 13 234 05 +2 
84 « S, < 36 .140 86 144 47 +3 
37 < 5, < 39 .058 89 .058 95 0 
40 «“ 5, < 42 17 02 .015 99 —6 
43 « 8, < 45 .003 43 002 83 —18 
4θ « 8, < 48 000 49 000 33 —33 
49 « 8, < 51 000 05 .000 03 — 40 
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(d) Let us find a number a such that, for large n, the inequality 
\S,*|> a has a probability near 3. For this it is necessary that 
Φ(α) -- 6(—a) = 4 or Φ(α) = 2. From tables of the normal distri- 


bution we find that a = 0.6745, and hence the two inequalities 
(3.1) |S, — np| < 0.6745(npq)t and (|S, — np| > 0.6745(npq)! 


are about equally probable. In particular, the probability is about 5 
that in n tossings of a coin the number of heads lies within the limits 
n/2 + 0.337n}, and, similarly, that in n throws of a die the number of 
aces lies within the interval n/6 + 0.251n3. The probability of S, lying 
within the limits np + 2(npq)' is about ®(2) — ®(—2) = 0.9545..., 
and for np + 3(npq)* the probability is 0.9973.... 

(e) A competition problem. This example illustrates practical appli- 
cations of formula (2.21). Two competing railroads operate one train 
each between Chicago and Los Angeles; the two trains leave and arrive 
simultaneously and have comparable equipment. We suppose that n 
passengers select trains independently and at random so that the num- 
ber of passengers in each train is the outcome of n Bernoulli trials with 
p = 4. If a train carries s < n seats, then there is a positive proba- 
bility f(s) that more than s passengers will turn up, in which case not 
all patrons can be accommodated. Using the approximation (2.21), 
we find 


(3.2) f(s) = 1— (= _ "). 
ni 


If s is so large that f(s) < 0.01, then the number of seats will be suffi- 
cient in 99 out of 100 cases. More generally, the company may decide 
on an arbitrary risk level a and determine 8 so that f(s) < a. For that 
purpose it suffices to put 


(3.3) 8s > 3(n + tn), 


where ἔα is the root of the equation a = 1 — (t,), which can be found 
from tables. For example, if n = 1000 and a = 0.01, then t, ~ 2.33 
and s = 537 seats should suffice. If both railroads accept the risk 
level a = 0.01, the two trains will carry a total of 1074 seats of which 
74 will be empty. The loss from competition (or chance fluctuations) 
is remarkably small. In the same way, 514 seats should suffice in 
about 80 per cent of all cases, and 549 seats in 999 out of 1000 cases. 

Similar considerations apply in other competitive supply problems. 
For example, if m movies compete for the same n patrons, each movie 
will put for its probability of success p = 1/m, and (8.3) is to be re- 
placed by s > (1/m)[n + tan*(m — 1)'].. The total number of empty 
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seats under this system is ms -- ἢ = t,n}(m — 1)3. For a = 0.01, 
n = 1000, and m = 2, 3, 4, this number is about 74, 126, and 147, 
respectively. The loss of efficiency because of competition is again 
small. 

(f) Random digits. In example II(8.b), we considered an event with 
p = 0.3024. Inn = 1200 trials this event had an average frequency of 
0.3142. The deviation from Ὁ is ε = 0.0118. In this case (pq)? = 0.4593 


and «(n/pq)! ~ 0.890.... Hence the probability of =. -- P| > eis 


in this case about 0.37.... This indicates that in about 37 per cent 
of all cases the average number of successes should deviate from p by 
more than it does in our material. 

_ (gy Sampling. A fraction p of a certain population are smokers. 
Suppose that p is unknown and that random sampling with replace- 
ment is to be used to determine p. It is desired to find p with an error 
not exceeding 0.005. How large should the sample size n be? If p’ is 
the fraction of smokers in the sample, we desire that |p’ — p| < 0.005. 
However, no sample size can give absolute assurance that |p’ — p| < 
0.005; it is conceivable that the sample contains only smokers. 
Since absolute certainty is unattainable, we settle for an arbitrary 
confidence level a, say, a = 0.95, and require that |p’ — p| < 0.005 
with probability 0.95 or better. Note that np’ is the number of suc- 
cesses in 7 trials, and hence 


Sn 
P{|p’ — p| < 0.005} = P {|= - P| < 0.005} . 
n 


We seek an n large enough to make this quantity greater than 0.95. 
For the present purposes the normal approximation is sufficient. The 
root x of ®(xz) — Φί --“) = 0.95 is x = 1.96..., and hence we should 
have 0.005(n/pq)? > 1.96. Thus we are led to the inequality n > 392? pq 
or Ὁ 160,000pq, approximately. It involves the unknown p, but pq 
never exceeds 4, and hence the sample size n = 40,000 would be safe 
under all circumstances; with it the odds are about 20 to 1 that 
|p’ — p| < 0.005. 


4, RELATION TO THE POISSON APPROXIMATION 


The error of the normal approximation will be small if npq is large. 
On the other hand, if n is large and p small, the terms b(k; n, p) will be 
found to be near the Poisson probabilities p(k; A) with A = np. If A 
is small, then only the Poisson approximation can be used. However, 
if \ is large, we can use either the normal or the Poisson approximation. 

This implies that for large values of ἃ it must be possible to approxi- 
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mate the Poisson distribution by the normal distribution, and in exam- 
ple X(1.c) we shall see that this is indeed so (cf. also problem 9). 
Here we shall be content to illustrate the point by a numerical and a 
practical example. 


Examples. (a) Consider the Poisson distribution p(k; 100) as an 
approximation, say, to the binomial distribution with n = 100,000,000 
and p = 1/1,000,000. Then npg ~ 100; this quantity, even though 
not large, suffices for the normal distribution to give reasonable ap- 
proximations at least for the central sector of the binomial distribution. 
The Poisson distribution p(k; 100) agrees with b(k; 10°, 10~*) to many 
decimals, and we can compare it with the normal approximation to the 
latter. Put, for brevity, P(a, δ) = p(a; 100) + p(a+1; 100) +...4+ 
+ p(b; 100), so that P(a, δ) stands for P{a < S, < δ) and should be 
approximated by Φ ez zr 328) — ® (-- 9) . The following 
sample gives an idea of the degree of approximation. 


Correct Values Normal Approximation 
P(85, 90) 0.113 84 0.110 49 
P90, 95) 184 85 179 50 
P(95, 105) 417 63 .417 68 
P90, 110) 106 52 706 28 
P(110, 115) .107 38 .110 49 
P(115, 120) .053 23 .053 35 


(ὃ) A telephone trunking problem. The following problem is, with 
some simplifications, taken from actual practice.® A telephone ex- 
change A is to serve 2000 subscribers in a nearby exchange B. It 
would be too expensive and extravagant to install 2000 trunklines from 
A to B. It will suffice to make the number N of lines so large that, 
under ordinary conditions, only one out of every hundred calls will 
fail to find an idle trunkline immediately at its disposal. Suppose that 
during the busy hour of the day each subscriber requires a trunkline 
to B for an average of 2 minutes. At a fixed moment of the busy hour 
we compare the situation to a set of 2000 trials with a probability 
p = 35 in each that a line will be required. Under ordinary condi- 
tions these trials can be assumed to be independent (although this is 
not true when events like unexpected showers or earthquakes cause 


*E. C. Molina, Probability in engineering, Electrical Engineering, vol. 54 (1985), 
pp. 423-427, or Bell Telephone System Technical Publications Monograph B-854. 
There the problem is treated by the Poisson method given in the text, which is 
preferable from the engineer’s point of view. | 
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many people to call for taxicabs or the local newspaper; the theory no 
longer applies, and the trunks will be “jammed”). We have, then, 
2000 Bernoulli trials with p = τσ, and the smallest number N is re- 
quired such that the probability of more than N “successes” will be 
smaller than 0.01; in symbols P{Seo00 = N} < 0.01. 

For the Poisson approximation we should take \ = 23% ~ 66.67. 
From the tables we find that the probability of 87 or more successes 
is about 0.0097, whereas the probability of 86 or more successes 1s 
about 0.013. This would indicate that 87 trunklines should suffice. 
For the normal approximation we first find from tables the root x 
of 1 — Φ(α) = 0.01, which is x = 2.827. Then it is required that 
(N — ἢ — np)/(npq)! > 2.327. Since n = 2000, p = 3p, this means 
N > 67.17 + (2.327)(8.027) ~ 85.8. Hence the normal approxima- 
tion would indicate that 86 trunklines should suffice. 

For practical purposes the two solutions agree. They yield further 
practical results. Conceivably, the installation might be cheaper if 
the 2000 subscribers were divided into two groups of 1000 each, and 
two separate groups of trunklines from A to B were installed. Using 
the method above, we find that actually some ten additional trunklines 
would be required so that the first arrangement is preferable. 


5. LARGE DEVIATIONS ” 


Frequently we desire an estimate of the probability that the reduced 
number of successes S,,* [cf. (2.19)] exceeds a given number x. Hence 
the upper limit of the interval is infinity, and it requires a special 
argument to show that our limit theorem (2.18) still applies. 


Theorem. [fn — οὐ and x varies as a function of n in such a way 
that x - ὦ but x°h — 0, then 


(5.1) P{S,* > 2} ~1— (2). 
In view of (1.7) this ts equivalent to 


a 


(5.2) P{S,* > 5) "Ὁ (Oni 


Proof. Choose in (2.18) the integers a and β so that x lies between 
χα and %o41, and that 73 ~ x+logz. Then ap°h — 0 and (2.18) 
holds. Hence 


(5.3) Ρία «8, <p} ~ {1 — H(ea)} — [1 — B(@p)}. 


10 The theorem is of general interest but will be used in this book only for the 
proof of the law of the iterated logarithm, chapter VIL, section 5. 
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However, from (1.7) and the fact that rg ~ χα + log 2, it is readily 
seen that 1 — &(g) is of smaller order of magnitude than 1 — &(z,), 
while 1 — ®(2,.) ~ 1 — (x). Hence 

(5.4) Ῥία <S, < 6B} ~1— &(). 

On the other hand, from (2.11) and VI(8.5) we have 


2 


n nh 
δ(β; n, p) ~ — φ(ᾳρ). 
tg 


Now nh? = 1/nq is a constant, and 


(5.5) PiS, 2 6} < δ 


1 
(5.6) -- (xg) ~1— (xz). 
XB 


We saw that the right side tends to zero faster than 1 — (x), which 
means that P{S, > 8} is of smaller order of magnitude than 1 — Φ(). 
Combining this result with (5.4), we see then that 


(5.7) PIS, > a} ~1 — Φ(ὺ, 


and this is our theorem. (Further limit theorems for large deviations 
are given in problems 12-17.) 


6. PROBLEMS FOR SOLUTION 
1. Generalizing (1.8), prove that 
1 1 1 .1-3 1-3-5 
Ξέϑ 9 ιοχ τ eta ἘΞ 
(2m) ° te αὖ ἃ" a7 ee 


+ (pe 


and that for z > 0 the right side overestimates 1 — P(x) if k is even, and under- 
estimates if k is odd. 
2. For every constant a > 0 


(6.2) {1-#(2+°)} = {1 — &2)} > τὰ 
as xz -- %, 

3. Find the probability that among 10,000 random digits the digit 7 appears 
not more than 968 times. 

4, Find an approximation to the probability that the number of aces ob- 
tained in 12,000 rollings of a die is between 1900 and 2150. 

5. Find a number k such that the probability is about 0.5 that the number 
of heads obtained in 1000 tossings of a coin will be between 440 and k. 

6. A sample is taken in order to find the fraction f of females in a popula- 
tion. Find a sample size such that the probability of a sampling error less 
than 0.005 will be 0.99 or greater. 


(6.1) 1— ὁ) ~ 
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7. In 10,000 tossings, a coin fell heads 5400 times. Is it reasonable to assume 
that the coin is skew? 


8. Find an approximation to the maximal term of the trinomial distribution 
n! 

: kno? 1 = n—k—r 

kira — by Pel — Pa — Ba). 


9. Normal approximation to the Poisson distribution. Using Stirling’s for- 
mula, show that, if Δ — ©, then for every fixed a < 6 


(6.3) 2. Phbid) > Φ(β) — Φίο). 
A+adA2 <K<A+BA 


10. Normal approximation to the hypergeometric distribution. Let n, m, k be 
positive integers and suppose that 


(6.4) —q, h{k-—rp} > 2 


r n m 
et Dy ae 
n+m n+m n+m 


where 1/h = {(n -+ m)pgt(1 — ὃ}. Prove that 


Ce)" a) 


(6.5 ὦ: rae τὴ hd(x). 

Hint: Use the normal approximation to the binomial distribution rather 
than Stirling’s formula. 

11. Normal distribution and combinatorial runs. In TI(11.19) we found that 


in an arrangement of n alphas and m betas the probability of having exactly 
k runs of alphas is 


n—1\/m+1\ . (n+m 
66) o= Gi, 8 ..} 
Let n — οὐ, m — οὐ 50 that (6.4) holds. For fixed a < β the probability that 


the number of alpha runs lies between npg + a(pgn)! and npg + B(pgn)? tends 
to (8) — &(a). 


Note: In the following problems h? = npq and S,,* is the reduced number of successes 
defined in (2.19). Finally 


(6.7) F,(x) = P{S,* > z}. 


12. If x varies as a function of n so that z°+*h — 0 but x -- οὐ, ΤΕ 12 


1Α, Wald and J. Wolfowitz, On a test whether two samples are from the same 
population, Annals of Mathematical Statistics, vol. 11 (1940), pp. 147-162. For 
more general results, see A. M. Mood, The distribution theory of runs, zbid., pp. 
367-392. 

2:N, Smirnov, Uber Wahrscheinlichkeiten grosser Abweichungen (in Russian, 
German summary), Recueil Mathématique [Sbornik] Moscou, vol. 40 (1933), pp. 
443-454. 
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_F (x) | 


(6.8) 1-4 


= 1 + o(z°), 
where o(x*) stands for terms that are of smaller order of magnitude than 2°. 
13. If 22h — 0,2 — οὐ, then 18 for any constant a > 0 


F(x) — Fila + a/x) 
F(x) 
In words, the conditional probability of « <S,* <2+a/z, given that 
S,* > 2, tends to 1 — e~*. [Hint: Use (5.2).] 
14. Probabilities of large deviations. Starting with (2.4), prove the following 
theorem. If n -- ©, and k varies so that (k — np)/n — 0, then 


(6.9) — ε΄ «ὦ 


h 
. ~~ .-Ὠ --ζ 2 —f (x) 
(6.10) b(k; n, p) On)! 6 
where x = (k — np)h and 
yo a=. (—q)’" ar 
(6.11) f(x) = > πε΄ τες »-ὥνσ. 


Note: If τῇ, — 0, then f(z) — 0, and (6.10) reduces to (2.11). If x is of the 
order of magnitude ‘of h-? but negligible as compared to h—}, then 


xh. 


(6.12) f(z) = 


If x is of the order of magnitude of h—}, then 


(6.13) fla) ~ P= 4 ee +a 


etc. 
15. Continuation. Prove that ifr — οὔ, rh — 0, 


x*h?, 


(6.14) f (2 + =) — f(z) > 0 
and hence 
(6.15) F(x) ~ ΘΠ — Φα))}. 


16. Deduce (6.9) from (6.15), assuming only zh — 0. 
17. If p > q, then for large x 


(Hint: Use problem 14.) 


18. A new derivation of the law of large numbers. Show that the law of large 
numbers is a consequence of the DeMoivre-Laplace limit theorem. 


13 A. Khintchine, Uber einen neuen Grenzwertsatz der Wahrscheinlichkeitsrech- 
nung, Mathematische Annalen, vol. 101 (1929), pp. 745-752. See also problem 16. 
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19. A new derivation of the normal approximation.“ Starting from VI(10.11)- 
(10.13), prove that when n — οὐ and k — o in such a way that &®n-? — 0, 
we have 


(6.17) b(k; n, p) ~ b(m; n, p)e—4&,”. 
20. If np < m < (n+1)p, show that 

m+l1 , ., m 

ἘΞ τ) S Wj n, p) <b (msn, 5). 


If (n+ 1)p — 1 « m< np, the same inequality holds with (m + 1)/(n + 1) 
in the extreme left member replaced by m/(n + 1). 


21. Conclude that b(m;n, p) ~ {2x(n + 1)pq}—? and give upper and lower 
bounds. 


(6.18) b (m; n, 


4 Problems 19 and 20 together imply that b(k; n, p) ~ {(n + 1)pq}4o(&). This 
is the same as the basic approximation formula (2.11) with 2; replaced by & and 
h = {npq}—* replaced by h’ = {(n + 1)pq}7*. Since a, ~ & and ἢ ~h’, the two 
formulas are asymptotically equivalent. Actually the new formula involves a 
smaller error term (in its derivation the error committed in passing from (2.7) to 
(2.8) is avoided). It should also be noted that the calculations required for problems 
19 and 20 are simpler and more intuitive than those used in the text; they involve 
only the standard estimates for logarithms as used in chapter II, section 8, and 
chapter VI, section 3. In short, the new formula and its derivation are superior 
to those of the text, but they do not conform to the time-honored use of np instead 
of (n + 1)p. 


CHAPTER VIII* 


Unlimited Sequences 
of Bernoulli Trials 


This chapter discusses certain properties of randomness and the im- 
portant law of the iterated logarithm for Bernoulli trials. A different 
aspect of the fluctuation theory of Bernoulli trials (at least for p = 4 
is covered in chapter ITI. 


1. INFINITE SEQUENCES OF TRIALS 


In the preceding chapter we have dealt with probabilities connected 
with n Bernoulli trials and have studied their asymptotic behavior as 
nm — ©, We turn now to a more general type of problem where the 
events themselves cannot be defined in a finite sample space. 


Example. A problemin runs. Let a and β be positive integers, and 
consider a potentially unlimited sequence of Bernoulli trials, such as 
tossing a coin or throwing dice. Suppose that Paul bets Peter that a 
run of a consecutive successes will occur before a run of 8 consecutive 
failures. It has an intuitive meaning to speak of the event that Paul 
wins, but it must be remembered that in the mathematical theory the 
term event stands for “aggregate of sample points” and is meaningless 
unless an appropriate sample space has been defined. The model of a 
finite number of trials is insufficient for our present purpose, but the 
difficulty is solved by a simple passage to the limit. In ἢ trials Peter 
wins or loses, or the game remains undecided. Let the corresponding 
probabilities be an, yn, Zn (tn + Yn + 2n = 1). As the number n of 
trials increases, the probability z, of a tie can only decrease, and both 
Tt, and yy, necessarily increase. Hence x = lim znj, y = lim yn, and 
z = lim z, exist. Nobody would hesitate to call them the probabilities 
of Peter’s ultimate gain or loss or of a tie. However, the corresponding 


* This chapter is not directly connected with the material covered in subsequent 
chapters and may be omitted at first reading. 
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three events are defined only in the sample space of infinite sequences 
of trials, and this space is not discrete. 


The example was introduced for illustration only, and the numerical values of 
In, Yn Zn are not our immediate concern. We shall return to their calculation in 
example XIII(8.b). The limits z, y, 2 may be obtained by a simpler method which 
is applicable to more general cases. We indicate it here because of its importance 
and intrinsic interest. 

Let A be the event that a run of a consecutive successes occurs before a run of B 
consecutive failures. Then A means Paul’s winning and x = P{A}. If u and v 
are the conditional probabilities of A under the hypotheses, respectively, that the 
first trial results in success or failure, then z = pu + qv [see V(1.8)]. Suppose 
first that the first trial results in success. In this case the event A can occur in 
a mutually exclusive ways: (1) The following a — 1 trials result in successes; the 
probability for this is p*~!. (2) The first failure occurs at the »th trial where 
2<y»<a. Let this event be H, Then P{H,} = p’—*q, and P{A|H,} =». 
Hente (using once more the formula for compound probabilities) 


(1.1) u = pt! + ql +p t+...p%*) = pt + o(1 — pe). 
If the first trial results in failure, a similar argument leads to 
(1.2) v= pultqt...¢¢% - κά — ᾳβ ἢ. 
We have thus two equations for the two unknowns u and v and find forz = pu + qv 


1 -- αβ 


1.3) ge 1 ee - τ’ 
( pe) fp get — pele 


To obtain y we have only to interchange p and gq, andaandf. Thus 
1 — ρα 


1.4) y= Qo ——_—_—_ ra’ 
( ρα 1 4 αβ τ. ,ὰ 1.5 1 


Since z + y = 1, we have z = 0; the probability of a tre 18 zero. 

For example, in tossing a coin (p = 4) the probability that a run of two heads 
appears before a run of three tails is 0.7; for two consecutive heads before four con- 
secutive tails the probability is 8 for three consecutive heads before four con- 
secutive tails $3. In rolling dice there is probability 0.1753 that two consecutive 
aces will appear before five consecutive non-aces, etc. 


In the present volume we are confined to the theory of discrete 
sample spaces, and this means a considerable loss of mathematical ele- 
gance. The general theory considers n Bernoulli trials only as the 
beginning of an infinite sequence of trials. A sample point is then 
represented by an infinite sequence of letters S and F, and the sample 
space is the aggregate of all such sequences. A finite sequence, like 
SSFS, stands for the aggregate of all points with this beginning, that 
is, for the compound event that in an infinite sequence of trials the first 
four result in S, S, F, S, respectively. In the infinite sample space the 
game of our example can be interpreted without a limiting process. 
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Take any point, that is, a sequence SSFSFF .... In it a run of a 
consecutive S’s may or may not occur. If it does, it may or may not 
be preceded by a run of 8 consecutive F’s. In this way we get a classi- 
fication of all sample points into three classes, representing the events 
“Peter wins,” ‘Peter loses,” “no decision.” Their probabilities are 
the numbers z, y, 2, computed above. The only trouble with this 
sample space is that it is not discrete, and we have not yet defined 
probabilities in general sample spaces. 

Note that we are discussing a question of terminology rather than a 
genuine difficulty. In our example there was no question about the 
proper definition or interpretation of the number z. The trouble is 
only that for consistency we must either decide to refer to the number 
xz as “the limit of the probability x, that Peter wins in n trials” or else 
talk of the event “that Peter wins,” which means referring to a non- 
discrete sample space. We propose to do both. For simplicity of 
language we shall refer to events even when they are defined in the 
infinite sample space; for precision, the theorems will also be formu- 
lated in terms of finite sample spaces and passages to the limit. The 
events to be studied in this chapter share the following salient feature 
of our example. The event ‘Peter wins,” although defined in an 
infinite space, is the union of the events “Peter wins at the nth trial’ 
(n = 1, 2, ...), each of which depends only on a finite number of 
trials. The required probability x is the limit of a monotonic sequence 
of probabilities x, which depend only on finitely many trials. We re- 
quire no theory going beyond the model of n Bernoulli trials; we merely 
take the liberty of simplifying clumsy expressions! by calling certain 
numbers probabilities instead of using the term “limits of probabilities.” 


2. SYSTEMS OF GAMBLING 


The painful experience of many gamblers has taught us the lesson 
that no system of betting is successful in improving the gambler’s 
chances. If the theory of probability is true to life, this experience 
must correspond to a provable statement. 

For orientation let us consider a potentially unlimited sequence of 
Bernoulli trials and suppose that at each trial the bettor has the free 
choice of whether or not to bet. A ‘‘system’’ consists in fixed rules 


1 For the reader familiar with general measure theory the situation may be de- 
scribed as follows. We consider only events which either depend on a finite number 
of trials or are limits of monotonic sequences of such events. We calculate the ob- 
vious limits of probabilities and clearly require no measure theory for that purpose. 
However, only general measure theory shows that our limits are independent of 
the particular passage to the limit and are completely additive. 
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selecting those trials on which the player is to bet. For example, the 
bettor may make up his mind to bet at every seventh trial or to wait 
as long as necessary for seven heads to occur between two bets. He 
may bet only following a head run of length 138, or bet for the first 
time after the first head, for the second time after the first run of two 
consecutive heads, and generally, for the kth time, just after k heads 
have appeared in succession. In the latter case he would bet less and 
less frequently. We need not consider the stakes at the individual 
trials; we want to show that no ‘‘system”’ changes the bettor’s situa- 
tion and that he can achieve the same result by betting every time. 
It goes without saying that this statement can be proved only for sys- 
tems in the ordinary meaning where the bettor does not know the 
future (the existence or non-existence of genuine prescience Js not our 
concern). Jt must also be admitted that the rule ‘‘go home after losing 
three times’’ does change the situation, but we shall rule out such unin- 
teresting systems. | 

We define a system as a set of fixed rules which for every trial uniquely 
determines whether or not the bettor 18 to bet; at the kth trial the decision 
may depend on the outcomes of the first k — 1 trials, but not on the outcome 
of trials number k, k+1, k+2, ...; finally the rules must be such as to 
ensure an indefinite continuation of the game. Since the set of rules is 
fixed, the event “in 7 trials the bettor bets more than r times’’ is well 
defined and its probability calculable. The last condition requires that 
for every r, asn — οὐ, this probability tends to 1. 

We now formulate our fundamental theorem to the effect that under 
any system the successive bets form a sequence of Bernoulli trials with 
unchanged probability for success. With an appropriate change of 
phrasing this theorem holds for all kinds of independent trials; the 
successive bets form in each case an exact replica of the original trials, 
so that no system can affect the bettor’s fortunes. The importance of 
this statement was first recognized by von Mises, who introduced the 
impossibility of a successful gambling system as a fundamental axiom. 
The present formulation and proof follow Doob.? For simplicity we 
assume that p = 3. 

Let A; be the event “‘first bet occurs at the kth trial.”” Our defini- 
tion of system requires that as n — οὐ the probability tends to one 
that the first bet has occurred before the nth trial. This means that 
P{Ai} + P{Ao} +...+P{An} — 1, or 


(2.1) ZP{A;} = 1. 


2 J. L. Doob, Note on probability, Annals of Mathematics, vol. 37 (1936), pp. 
363-367. 
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Next, let Βα be the event ‘‘head at kth trial.”’ Then the event B 
‘“‘when first bet is made the trial results in heads” is the union of the 
events A,B,, A2Bo, A3B3, ... which are mutually exclusive. Now 
A, depends only on the outcome of the first k — 1 trials, and B,; 
only on the trial number k. Hence A; and Μὰ are independent and 
P{A,B,} = P{Ax}P{Bz} = 5P{Ax}. Thus P{B} = 2P{A,Bz} = 
= $2P{A;,} = 4. This shows that under this system the probability 
of heads at the first bet is 4, and the same statement holds for all 
subsequent bets. 

It remains to show that the bets are stochastically independent. 
This means that the probability that the coin falls heads at both the 
first and the second bet should be 4 (and similarly for all other com- 
binations and for the subsequent trials). To verify this statement let 
A;* be the event that the second bet occurs at the kth trial. Let EH 
represent the event ‘‘heads at the first two bets’; it is the union of all 
events A;B;A;,*B, where 7 < k (11 > k, then A; and A;* are mutually 
exclusive and A;A;,* = 0). Therefore 


(2.2) P{E} = >> Σ᾽ P{A;B,;Az*B,}. 
j=l k=j+i 


As before, we see that for fixed 7 and k > 7, the event B, (heads at 
kth trial) is independent of the event A;B;A,* (which depends only on 
the outcomes of the first k — 1 trials). Hence 


(2.3) ΡΙΒῚ τ-ἰ 2 ΡΙ4;8,4κ5") = 


@ οο 
= ΣΙ P{A;B;} 2) ΡίΑ4.",4;8)} 
jal k=j+1 
[οἵ. V(1.8)]. Now, whenever the first bet occurs and whatever its out- 
come, the game is sure to continue, that is, the second bet occurs sooner 
or later. This means that for given A;B; with P{A,;B;} > 0 the con- 
ditional probabilities that the second bet occurs at the kth trial must 
add to unity. The second series in (2.3) is therefore unity, and we 
have already seen that 2P{A;B;} = 3. Hence P{EZ} = 1 as con- 
tended. A similar argument holds for any combination of trials. 


Note that the situation is different when the player is permitted to 
vary arbitrarily the amounts which he puts down. With systems de- 
pending on the accumulated gain, there exist advantageous strategies, 
and the game depends on the strategy. We shall return to this point 
in chapter XIV, section 2. 
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3. THE BOREL-CANTELLI LEMMAS 


Two simple lemmas concerning infinite sequences of trials are used 
so frequently that they deserve special attention. We formulate them 
for Bernoulli trials, but they apply to more general cases. 

We refer again to an infinite sequence of Bernoulli trials. Let Aj, 
Ag, ... be an infinite sequence of events each of which depends only 
on a finite number of trials; in other words, we suppose that there 
exists an integer n; such that A; 1s an event in the sample space of 
the first 7; Bernoulli trials. Put 


(3.1) ας = P{A;}. 


(For example, A; may be the event that the 2kth trial concludes a run 
of at least Καὶ consecutive successes. Then nz = 2k and a, = p’*.) 

For every infinite sequence of letters S and F it is possible to estab- 
lish whether it belongs to 0, 1, 2, ... or infinitely many among the 
{A,}. This means that we can speak of the event U,, that an unend- 
ing sequence of trials produces more than r among the events {Ax}, 
and also of the event U,,, that infinitely many among the 4} occur. 
The event U, is defined only in the infinite sample space, and its prob- 
ability is the limit of P{U,,,}, the probability that n trials produce 
more than r among the events {A;}. Finally, P{U,.} = lim P{U,}; 
this limit exists since P{U,} decreases as r increases. 


Lemma 1. If Lax converges, then with probability one only finitely 
many events A; occur. More precisely, it is claimed that for r sufficiently 
large, P{U,} < «or: to every « > 0 τ 18 possible to find an integer r such 
that the probability that n trials produce one or more among the events 
Arai, Arte, ... 18 less than ε for all n. 


Proof. Determine r so that d-41 + dr42 +...< ε; this is possible 
since Da; converges. Without loss of generality we may suppose that 
the A; are ordered in such a way that ny < ne < ng <.... Let N be 
the last subscript for which ny <n. Then Ay, ..., Aw are defined in 
the space of n trials, and the lemma asserts that the probability that 
one or more among the events A,;11, Arie, ..., Aw occur 15 less than e. 
This is true, since by the fundamental inequality I(7.6) we have 


(3.2) P{Ay4y U Arse U See U An} < Or41 + Or 42 +.. . Ἢ ΟΝ « ε, 


as contended. 

A satisfactory converse to the lemma is known only for the special 
case of mutually independent 4. This situation occurs when the trials 
are divided into non-overlapping blocks and A; depends only on the 
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trials in the kth block (for example, A; may be the event that the kth 
thousand of trials produces more than 600 successes). 


Lemma 2. If the events A; are mutually independent, and if Zax 
diverges, then with probability one infinitely many A, occur. In other 
words, it is claimed that for every r the probability that n trials pro- 
duce more than r among the events {A;} tends to lasn — οὐ. 


Proof. As in the proof of lemma 1 let A,, Ag, .-., Aw be the 
events defined in the sample space of n trials. The probability 
that none of them occurs is, because of the assumed independence, 
(1 — ay)(1 — ag) - -- (1 — ay). Nowl —2< e*for0 <a < 1,and 
hence (1 — a;)(1 — ag) --- (1 — ay) < ε΄ @itart:::+4N). with increas- 
ing N the last quantity tends to zero. We have thus proved that with 
probability one at least one among the {4} occurs. 

Next, divide the sequence {A;} into two subsequences {A;,*} and 
{A,**} so that both series 2P{Ax*} and =P{A;**} diverge. Applying 
our result to these subsequences we find that, with probability one, at 
least one A;* and one A;,** occur. Therefore there is probability one 
that at least two among the {48} occur. Applying, in turn, this state- 
ment to the sequences {A,*} and {A,**} we find that at least four 
among the {A;} are bound to occur, etc. 


Example. What is the probability that in a sequence of Bernoulli 
trials the pattern SF.S appears infinitely often? Let A; be the event 
that the trials number k, & + 1, and k + 2 produce the sequence SFS. 
The events A, are obviously not mutually independent, but the 
sequence Ai, 44, 47, 410, --- contains only mutually independent 
events (since no two depend on the outcome of the same trials). Since 
a, = pq is independent of k, the series a; + ας + a7 +... diverges, 
and hence with probability one the pattern SF'S occurs infinitely often. 
A similar argument obviously applies for arbitrary patterns. (For 
further examples see problems 4 and 5.) 


4. THE STRONG LAW OF LARGE NUMBERS 


The intuitive notion of probability is based on the expectation that 
the following is true: If the number of successes in the first n trials of 
a sequence of Bernoulli trials is S,, then 


Sn 
(4.1) — — 7p. 
nN 
In the abstract theory this cannot be true for every sequence of trials; 
in fact, our sample space contains a point representing the conceptual 
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possibility of an infinite sequence of uninterrupted successes, and for 
it S,/n = 1. However, it is demonstrable that (4.1) holds with prob- 
ability one, so that the cases where (4.1) does not hold form a negligible 
exception. 

Note that we deal with a statement much stronger than the weak 
law of large numbers [VI(4.2)]. The latter says that for every suffi- 
ciently large fixed n the average S,,/n is likely to be near p, but it does 
not say that S,/n is bound to stay near p if the number of trials is 
increased. It leaves open the possibility that in n additional trials at 
least one of the eventsS,41/(n + 1) < p — εἰ orSnio/(n + 2) <p —e, 
»++, OF Son/2n < p — εἰ, occurs; the probability of this is the sum of a 
large number of probabilities of which we know only that they are in- 
dividually small. We shall now prove that with probability one 
S,/n — p becomes and remains small. 


Strong Law of Large Numbers. For every « > 0 we have prob- 
ability one that only finitely many of the events 


Sn 
(4.2) J —p|>. 
n 


occur. This implies that (4.1) holds with probability one. In terms 
of finite sample spaces, it is asserted that to every e > 0, 5 > 0 there 
corresponds an r such that for all » the probability of the simultaneous 
realization of the ν inequalities 


(4.3) 


is greater than 1 — 6. 
Proof. We shall prove a much stronger statement. Let Az be the 
event 
Sz = kp 
(kpq)? 
where a > 1. It is then obvious from VII(5.2) that, at least for all k 
sufficiently large, 


(4.4) > (2a log k)}, 


—a log k 1 
(4.5) P{A;} <e == = 
Hence 2P{A;,} converges, and lemma 1 of the preceding section ensures 
that with probability one only finitely many inequalities (4.4) hold. On 
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the other hand, if (4.2) holds, then 

5, — np € 
(npq)? (pq) 


and for large n the right side is larger than (2a log n)!. Hence, the 
realization of infinitely many inequalities (4.2) implies the realization 
of infinitely many A, and has therefore probability zero. 

The strong law of large numbers was first formulated by Cantelli 
(1917), after Borel and Hausdorff had discussed certain special cases. 
Like the weak law, it is only a very special case of a general theorem 
on random variables. Taken in conjunction with our theorem on the 
impossibility of gambling systems, the law of large numbers implies 
the existence of the limit (4.1) not only for the original sequence of 
trials but also for all subsequences obtained in accordance with the 
rules of section 2. Thus the two theorems together describe the funda- 
mental properties of randomness which are inherent in the intuitive notion 
of probability and whose importance was stressed with special emphasis 
by von Mises. 


5. THE LAW OF THE ITERATED LOGARITHM 


As in chapter VII let us again introduce the reduced number of suc- 
cesses in 7 trials 


ni 


(4.6) 


Ss -- 
(5.1) ee as δ ὁ πῇ 
(ηρᾳ)} 


The Laplace limit theorem asserts that P{S,* > x} ~ 1 — Φ(). 
Thus, for every particular value of it is improbable to have a large 
S,*, but it is intuitively clear that in a prolonged sequence of trials 
S,,* will sooner or later take on arbitrarily large values. Moderate 
values of S,,* are most probable, but the maxima will slowly increase. 
How fast? In the course of the proof of the strong law of large numbers 
we have concluded from (4.5) that with probability one the inequality 
S,* < (2alogn)? holds for each a > 1 and all sufficiently large ἢ. 
This provides us with an upper bound for the fluctuations of S,*, but 
this bound is bad. To see this, let us apply the same argument to the 
subsequence So*, S4*, Sg*, Sie*, ...; that is, let us define the event A; 
by Soe* > (2a log k)?. The inequality (4.5) now implies that S2* < 
< (2a log k)# for a > 1 and all sufficiently large &. But for n = 2" we 
have log k ~ log log n, and we conclude that for each a > 1 and all n 
of the form n = 2* the inequality 


(5.2) S,* < (2a log log n)! 
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will hold from some k onward. It is now a fair guess that in reality 
(5.2) holds for all n sufficiently large and, in fact, this is one part of 
the law of the iterated logarithm. This remarkable theorem ? asserts 
that (2 log log 7)? is the precise upper bound in the sense that for each 
a < 1 the reverse of the inequality (5.2) will hold for infinitely many n. 


Theorem. With probability one we have 
No ad 


5.3 lim sup ———————-- = 1 
e2) nme (2 log log n)} 


This means: For ἃ > 1 with probability one only finitely many of the 
events 


(5.4) Sn > np + A(2npq log log n)? 


occur; for \ < 1 with probability one (5.4) holds for infinitely many n. 
For reasons of symmetry equation (5.3) implies that 


Sn* 


(5.3a) lim inf ———-——. = — 
ne (2 log log n)} 


Proof. We start with two preliminary remarks. 
(1) There exists a constant c > Ὁ which depends on p, but not on n, 
such that 


(5.5) PiS, >np}>c 


for alln. In fact, an inspection of the binomial distribution shows that 
the left side in (5.5) is never zero, and the Laplace limit theorem shows 
that it tends to 4 as nm — . Accordingly, the left side is bounded 
away from zero, as asserted. 

(2) We require the following lemma: Let x be fixed, and let A be 
the event that for at least one k with k <n 


(5.6) S; —kp> 1. 

Then 
1 

(5.7) P{A} <-P{S, — np > 2x}. 
Cc 


3A, Khintchine, Uber einen Satz der Wahrscheinlichkeitsrechnung, Fundamenta 
Mathematicae, vol. 6 (1924), pp. 9-20. The discovery was preceded by partial 
results due to other authors. The present proof is arranged so as to permit straight- 
forward generalization to more genera] random variables. 
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For a proof of the lemma let A, be the event that (5.6) holds for 
k = v but not for k = 1, 2, ..., ν-- 1 (here 1 < » <n). The events 
Ay, Ag, ..-, An are mutually exclusive, and A is their union. Hence 


(5.8) P{A} = P{Ai} +...+ P{An}. 


Next, for ν < n let U, be the event that the total number of successes 
in the trials number v+1, v+2, ..., πὶ exceeds (n — v)p. If both 
A, and U, occur, then S, ὅν + (n — v)p > np + 2, and since the 
A,U, are mutually exclusive, this implies 


(5.9) P{S, —np > x} > P{AiUi} + P{A2Ue} +...+ 
| + P{An_1Un_1} + P{A,}. 


Now A, depends only on the first ν trials and U, only on the following n —p 
trials. Hence A, and U, are independent, and P{A,U,} = P{A,}P{U,}. 
From the preliminary remark (5.5) we know that P{U,} > c, and since 
c < 1,.we get from (5.9) and (5.8) 


(5.10) P{S, —np > x} >c=P{A,} =cP{A}. 
This proves (5.7). 


(3) We now prove the part of the theorem relating to (5.4) with 
X > 1. Let y be a number such that 


(5.11) l<y <M, 


and let n, be the integer nearest to γ΄ (r = 1, 2, ...). Let B, be the 
event that the inequality 


(5.12) S, — np > λ(ῶη,ρᾳ log log n,)? 


holds for at least one n with n, <n < n,41. Obviously (5.4) can hold 
for infinitely many 7 only if infinitely many B, occur. Using the first 
Borel-Cantelli lemma, we see therefore that it suffices to prove that 


(5.18) 2=P{B,} converges. 
By the inequality (5.7) 
(5.14) P{B,} < ες," Ῥίδηκμ — Mr4ip > A(2nrpq log log n,)t} = 


Ny i 
=c ip {Shu ».λ (: log log 5} ᾿ 
Np 41 


Now n,/Nr41~ Y > A}, and hence for sufficiently large r 


(5.15) P{B,} < ο΄ P{Sz,,, > (2d log log n,)'}. 
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From formula VII(5.2) we get, therefore, for large r, 

1 1 
(5.16) P{B,} < cute ἴοε oem -- --------τ-ς ~~ ——;" 
c(log n,) c(r log y) 


Since \ > 1, the assertion (5.13) is proved. 
(4) Finally, we prove the assertion concerning (5.4) with A < 1. 
This time we choose for y an integer so large that 


y¥—1 


(5.17) >n>A 


Y 


where 7 is a constant to be determined later, and put n, = γ. The 
second Borel-Cantelli lemma applies only to independent events, and 
for this reason we introduce 


(5.18) D, = 8,, -S._,3 


D, is the total number of successes following trial number n,_; and 
up to and including trial n,; for it we have the binomial distribution 
b(k;n, p) with n = n, — N-_1. Let A; be the event 


(5.19) Ὁ, — (n; — Nr_1)p > n(2pqn, log log n,)}. 


We claim that with probability one infinitely many A, occur. Since the 
various A, depend on non-overlapping blocks of trials (namely, 
N31 <n <n,), they are mutually independent, and, according to 
the second Borel-Cantelli lemma, it suffices to prove that ZP{A,} 
diverges. Now 


(5.20) P{A,} = 
D, καί rT fb r 4 
ΞΡ [5 Ξ 8: ΞΞῈΡ » ν(:-- 5" — tog tog κι) ᾿ 
(ιν, ams Nr—1) pq}? Ny στ Np—1 
Here n,/(nr — Nr—1) = ¥/(y — 1) < 97’, by (5.17). Hence 


D, aa (n, 3 Nr—1)D 


{(n = ἡ 1)pq} > (2n log log nm) : 


(5.21) P{A,} ΣΡ 
Using again the estimate (5.2) of chapter VII, we find for large r 


(5.22) P{A,} > —————— e771 be bem = Fr 
2n log log n, 2n(log log n,)(log n,)? 


Since ἢ, = γ᾽ and n < 1, we find that for large r we have P{A,} > 1/r, 
which proves the divergence of ZP{A,}. 
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The last step of the proof consists in showing that S,,_, in (5.18) can 
be neglected. From the first part of the theorem, which has already 
been proved, we know that to every « > 0 we can find an N so that, 
with probability 1 — «¢ or better, for all r > N, 


(5.23) [Sn,_. τ %r-1p| < 2(2pgn,_1 log log n,_1)}. 
Now suppose that 7 is chosen so close to 1 that 

(5.24) 3 = ? 5 ae 

Then from (5.17) 

(5.25) 4π,.α = 1 < n(n — 2)? 


and hence (5.23) implies 
(5.26) 5,.. — Ὁ» —(n — r)(2pqn, log log n,)}. 


Adding (5.26) to (5.19), we obtain (5.4) with n = n,. It follows that, 
with probability 1 — ε or better, this inequality holds for infinitely 
many 7, and this accomplishes the proof. 

The law of the iterated logarithm for Bernoulli trials is a special 
case of a more general theorem first formulated by Kolmogorov. At 
present it is possible to formulate stronger theorems (cf. problems 7 
and 8). 


6. INTERPRETATION IN NUMBER THEORY 
LANGUAGE 


Let x be a real number in the interval 0 < x < 1, and let 
(6.1) Ὁ = .410003... 


be its decimal expansion (so that each a; stands for one of the digits 
0,1, ...,9). This expansion is unique except for numbers of the form 
a/10" (where a is an integer), which can be written either by means of 
an expansion containing infinitely many zeros or by means of an ex- 
pansion containing infinitely many nines. To avoid ambiguities we 
now agree not to use the latter form. 

The decimal expansions are connected with Bernoulli trials with 
p = τίς, the digit 0 representing success and all other digits failure. 
If we replace in (6.1) all zeros by the letter S and all other digits by F, 
then (6.1) represents a possible outcome of an infinite sequence of 


4 A. Kolmogoroff, Das Gesetz des iterierten Logarithmus, Mathematische Annalen, 
vol. 101 (1929), pp. 126-135. 
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Bernoulli trials with p = 5. Conversely, an arbitrary sequence of 
letters S and F can be obtained in the described manner from the ex- 
pansion of certain numbers x. In this way every event in the sample 
space of Bernoulli trials is represented by a certain aggregate of num- 
bers zx. For example, the event ‘‘success at the nth trial” is repre- 
sented by all those x whose nth decimal is zero. This is an aggregate 
of 1011 intervals each of length 107, and the total length of these 
intervals equals τίσ, which is the probability of our event. Every 
particular finite sample sequence of length n corresponds to an aggre- 
gate of certain intervals; for example, the sequence SFS is represented 
by the nine intervals 0.01 <2 < 0.011, 002<2z< 0.021, ..., 
0.09 < « < 0.091. The probability of each such sample sequence 
equals the total length of the corresponding intervals on the z-axis. 
Probabilities of more complicated events are always expressed in terms 
of probabilities of finite sample sequences, and the calculation proceeds 
according to the same addition rule that is valid for the familiar 
Lebesgue measure on the z-axis. Accordingly, our probabilities will 
always coincide with the measure of the corresponding aggregate of 
points on the z-axis. We have thus a means of translating all limit 
theorems for Bernoulli trials with p = τίσ into theorems concerning 
decimal expansions. The phrase ‘‘with probability one” is equivalent 
to “for almost all x” or “almost everywhere.”’ 

We have considered the random variable S, which gives the number 
of successes in 7 trials. Here it is more convenient to emphasize the 
fact that S, is a function of the sample point, and we write S,(x) for 
the number of zeros among the first n decimals of x. Obviously the graph 
of S,,(x) is a step polygon whose discontinuities are necessarily points 
of the form a/10", where a is an integer. The ratio S,(x)/n is called 
the frequency of zeros among the first n decimals of x. 

In the language of ordinary measure theory the weak law of large 
numbers asserts that S,(x)/n — τίσ in measure, whereas the strong 
law states that S,(x)/n — τίσ almost everywhere. Khintchine’s law 
of the iterated logarithm shows that | 


(6.2) lim su sa ie ae 


= (0.3)2! 
ὴ (n log log η)3 10.) 


for almost all x. It gives an answer to a problem treated in a series 
of papers initiated by Hausdorff δ (1913) and Hardy and Littlewood ° 


5 F, Hausdorff, Grundztige der Mengenlehre, Leipzig, 1913. 
‘ Hardy and Littlewood, Some problems of Diophantine approximation, Acia 
Mathematica, vol. 37 (1914), pp. 155-239. 
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(1914). For a further improvement of this result see problems 7 and 8. 

Instead of the digit zero we may consider any other digit and can 
formulate the strong law of large numbers to the effect that the fre- 
quency of each of the ten digits tends to τίς for almost all. A similar 
theorem holds if the base 10 of the decimal system is replaced by any 
other base. This fact was discovered by Borel (1909) and is usually 
expressed by saying that almost all numbers are “normal.” 


7. PROBLEMS FOR SOLUTION 


1. Find an integer 8 such that in rolling dice there are about even chances 
that a run of three consecutive aces appears before a non-ace run of length 8. 

2. Consider repeated independent trials with three possible outcomes A, B, 
C and corresponding probabilities p,q,r ( +qa+r= 1). Find the probabil- 
ity that a run of a consecutive A’s will occur before a B-run of length 6. 

3. Continuation. Find the probability that an A-run of length @ will occur 
before either a B-run of length 6 or a C-run of length γ. 

4. In a sequence of Bernoulli trials let A, be the event that a run of n 
consecutive successes occurs between the 2th and the 2”+'st trial. If p > 3, 
there is probability one that infinitely many A, occur; if p < 4, then with 
probability one only finitely many A, occur. 

5.7 Denote by N,, the length of the success run beginning at the nth trial 
(i.e., ΝᾺ = 0 if the nth trial results in F, etc.). Prove that with probability 
one 


N,, 
(7.1) lim sup ΕΣ, 1 
where Log denotes the logarithm to the basis 1/p. 

Hint: Consider the event A, that the nth trial is followed by a run of more 
than a Log n successes. For a > 1 the calculation is straightforward. For 
a < 1 consider the subsequence of trials number αἱ, dz, ... where ἀμ iS an 
integer very close to n Log n. 

6. From the law of the iterated logarithm conclude: With probability one 
it will happen for infinitely many n that all 5. with n < k < 17m are positive. 
(Note: Considerably stronger statements can be proved using the results of 
chapter ITI.) 

7. Let o(t) be a positive monotonically increasing function, and let n, be 
the nearest integer to e’""". If 


6 —}¢2 (ny) 


re) Σ Fins) 


converges, then with probability one, the inequality 
(7.3) S, > np + (npq)*o(n) 
takes place only for finitely many n. Note that without loss of generality we 


7 Suggested by a communication from D. J. Newman. 
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may suppose that ¢(n) < 10(log log n)?; the law of the iterated logarithm 
takes care of the larger ¢(n). 


8. Prove ὃ that the series (7.2) converges if, and only if, 
(7.4) Σ #2 on 970) 
converges. (Hint: Collect the terms for which n,_1 « ἢ « ἢ, and note that 


Nr — Ny_1~n,(1 — 1/logr); furthermore, (7.4) can converge only if 
g*(n) > 2 log log n.) 


8 Problems 7 and 8 together show that in case of convergence of (7.4) the inequal- 
ity (7.3) holds with probability one only for finitely many n. Conversely, if (7.4) 
diverges, the inequality (7.3) holds with probability one for infinitely many n. 
This converse is much more difficult to prove; cf. W. Feller, The general form of the 
so-called law of the iterated logarithm, Transactions of the American Mathematical 
Society, vol. 54 (1943), pp. 373-402, where more general theorems are proved for 
arbitrary random variables. For the special case of Bernoulli trials with p = $ 
cf. P. Erdés, On the law of the iterated logarithm, Annals of Mathematics (2), vol. 
43 (1942), pp. 419-436. The law of the iterated logarithm follows from the particular 
case ¢(t) = (2 log log ὃ). 


CHAPTER IX 


Random Variables; Expectation 


1. RANDOM VARIABLES 


According to the definition given in calculus textbooks, the quantity 
y is called a function of the real number z if to every x there corresponds 
a value y. This definition can be extended to cases where the inde- 
pendent variable is not a real number. Thus we call the distance a 
function of a pair of points; the perimeter of a triangle is a function 
defined on the set of triangles; a sequence a, is a function defined for 


x 
all positive integers; the binomial coefficient (*) is a function defined 


for pairs of numbers (z, k) of which the second is a non-negative inte- 
ger. In the same sense we can say that the number S,, of successes in 
n Bernoulli trials is a function defined on the sample space; to each of 
the 2” points in this space there corresponds a number §S,,. 

A function defined on a sample space ts called a random variable. 
Throughout the preceding chapters we have been concerned with ran- 
dom variables without using this term. Typical random variables are 
the number of aces in a hand at bridge, of multiple birthdays in a 
company of n people, of success runs in 7 Bernoulli trials. In each 
case there is a unique rule which associates a number X with any 
sample point. The classical theory of probability was devoted mainly 
to a study of the gambler’s gain, which is again a random variable; in 
fact, every random variable can be interpreted as the gain of a real or 
imaginary gambler in a suitable game. The position of a particle under 
diffusion, the energy, temperature, etc., of physical systems are random 
variables; but they are defined in non-discrete sample spaces, and their 
study is therefore deferred. In the case of a discrete sample space we 
can actually tabulate any random variable X by enumerating in some 
order all points of the space and associating with each the corresponding 
value of X. 
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The term random variable is somewhat confusing; random function 
would be more appropriate (the independent variable being a point in 
sample space, that is, outcome of an experiment). 

Let X be a random variable and let 21, x2, ... be the values which 
it assumes;! in most of what follows the z; will be integers. The 
aggregate of all sample points on which X assumes the fixed value 1; 
forms the event that X = χ;; its probability is denoted by P{X = 2z;}. 
The function 


(1.1) ΡΙΧ = 2;} = f(x) (j = 1, 2, ...) 
as called the (probability) distribution 2 of the random variable X. Clearly 
(1.2) f(z) 20, = f(a) = 1. 


- With this terminology we can say that in Bernoulli trials the number 

of successes S, is a random variable with probability distribution 
{b(k; n, p)}, whereas the number of trials up to and including the first 
success is a random variable with the distribution {q*~'p}. 

Consider now two random variables X and Y defined on the same 
sample space, and denote the values which they assume, respectively, 
by 21, 22, ..., and yj, yo, ...; let the corresponding probability dis- 
tributions be {f(x;)} and {g(yz)}. The aggregate of points in which 
the two conditions X = x; and Y = y, are satisfied forms an event 
whose probability will be denoted by P{X = 2;, ¥Y = yz}. The function 


(1.3) P{X = xj, ¥ = yx} = pj, yx) (j,k = 1,2, ...) 
1s called the joint probability distribution of X and Y. It is best exhibited 


ΤῊ the standard mathematical terminology the set of points 11, v2, ... should 
be called the range of X. Unfortunately the statistical literature uses the term range 
for the difference between the maximum and the minimum of Χ. 

2 For a discrete variable X the probability distribution is the function f(x;) de- 
fined on the aggregate of values x; assumed by X. This term must be distinguished 
from the term “distribution function,’ which applies to non-decreasing functions 
which tend to0 asx — —oandtolasz — o. The distribution function F(z) 
of X is defined by 


F(z) = P{X < 2} = Σ Sai) 


the last sum extending over all those 2; which do not exceed x. Thus the distribu- 
tion function of a variable can be calculated from its probability distribution and 
vice versa. In this volume we shall not be concerned with distribution functions 
in general. 
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in the form of a double-entry table as exemplified in tables 1 and 2. 
Clearly 


(1.4) prj, yx) Σ 0, Συρίω» ye) = 1. 
7.8 
Moreover, for every fixed j 


(1.5) play, yr) + (es, yo) + Ῥω» ys) τ ...Ξ P(X = ἢ} = f(x) 
and for every fixed k 


(1.6) pei, yx) + p(t, ye) + p(s, yx) +...= PLY = yx} = gyn). 


In other words, by adding the probabilities in individual rows and 
columns, we obtain the probability distributions of X and Y. They 
may be exhibited as shown in tables 1 and 2 and are then called mar- 
ginal distributions. The adjective “marginal” refers to the outer ap- 
pearance in the double-entry table and is also used for stylistic clarity 
when the joint distribution of two variables and also their individual 
(marginal) distributions appear in the same context. Strictly speak- 
ing, the adjective “marginal” is redundant. 

The notion of joint distribution carries over to systems of more than 
two random variables. 


Examples. (a) Random placements of 3 balls into 3 cells. We refer 
to the sample space of 27 points defined formally in table 1 accom- 
panying example I(2.a); to each point we attach probability 17. Let 
N denote the number of occupied cells, and forz = 1, 2, 3 let X; denote 
the number of balls in the cell number 7. These are picturesque de- 
scriptions. Formally N is the function assuming the value 1 on the 
sample points number 1-3; the value 2 on the points number 4-21; 
and the value 3 on the points number 22-27. Accordingly, the prob- 
ability distribution of N is defined by P{N = 1} = 3, P{N = 2} = 3, 
P{N = 3} = 2. The joint distributions of (N, Xi) and of (Xj, Xo) 
are given in tables 1 and 2. 

(Ὁ) Dice. Inn throws of an ideal die let X;, Xp, Xz, respectively, de- 
note the number of ones, twos, and threes. The probability p(k1, ke, k3) 
that the n throws result in k, ones, ke twos, kg threes, and n — ky — 
— ky — kg other faces is given by the multinomial distribution VI(9.2) 
with p1 = pe = Ds = ὁ, Pa = 2, that is, by 

πη 


se a Se cee ee τ ee ες nh hy hg πὶ 
ον a ae ST ee ay ET 


This is the joint distribution of X,, Xe, X3. Keeping k,, ke fixed and 
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TABLE 1 


Joint DISTRIBUTION oF (N, X;) IN ExampLe (a) 


0 1 2 £38 || Distribution of N 
1 2q 0 0 
N 2 6g 6q 64 
3 0 θᾳ 


Distribution of ΧῚ 8q 122 θᾳ 


E(N) = 32, E(N*) = 427, Var(N) = 32 
E(X)) = 1, E(X;”) = ἐν, Var(Xi) = 3 
E(NX,) = 42, Cov(N, X;) = 0. 


N is the number of occupied cells, X; the number of balls in the first tell 
when 3 balls are distributed randomly in 3 cells. For abbreviation ¢ = υἷτ. 


TABLE 2 
Joint DIsTRIBUTION oF (X;, Χο) In ExamMpueE (a) 


Xy 


0 1 2 3 || Distribution of X, 


0 q 3¢ 84 4 8g 
1 834 6g 82. 0 12 
X, 2 96 ὃφ 0 0 6g 
3 4 0 0 0 q 
Distribution of ΧΙ Sq 122 6¢ 4 
E(X;) =1, E(X?) = 48, Var(X;) = 3 
E(X, Χο) = 2, Cov(Xi, X:) = —2. 


X; is the number of balls in the ith cell when 3 balls are distributed randomly 
in 3 cells. For abbreviation g = 3. 
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summing (1.7) over the possible values kg = 0, 1, ..., n—ki—ke, we 
get, using the binomial theorem, 


a »ἀρὼπΞ---- ---- ροατὴρο 

1" Ae ἰκοῖ(η — key — ee)! 

This is the joint distribution of (X,, X2), which now appears as mar- 
ginal distribution for the triple distribution of Xi, X2, X3. Needless 
to say that (1.8) could have been obtained directly from the multi- 
nomial distribution. Summing (1.8) once more over all k. = 0, 1, 

,n— k, we obtain the distribution of X,, namely the binomial dis: 
tribution with p = ἐξ. 

(c) Sampling. Let a population of n elements be divided into three 
classes of respective sizes nj = Npi, Ne = Np2, and ng = np3 (where 
ρι + po + ps = 1). Suppose that a random sample of size r is drawn, 
and denote by X,; and X,_ the numbers of representatives of the first 
and second class in the sample. If the sample is with replacement, 
P{X, = ky, Xz = ke} is given by the multinomial distribution 

᾿Ξ: ΕΠ. eee kim Κῶ, 7 ky —k 
(1.9) Ski, ke) = πεσε, — ky? Ρ2 3 2, 
(See formula VI(9.2).] The variable X; has the binomial distribution 
{b(k;7, p;)}. If the sampling is without replacement, then P{X, = ky, 
Χο = ke} is given by the double hypergeometric distribution II(6.5) 
and X, has the simple hypergeometric distribution IT(6.1). 

(d) Randomized sampling. Consider once more the preceding exam- 
ple but suppose that the sample size 7, instead of being fixed in advance, 
depends on the outcome of a random experiment. More precisely, 
suppose that the size of the sample depends on a Poisson distribution: 
The probability that the sample size is r is p(r;A) = e>y"/r! and, 
given the sample size r, the (conditional) probability that X, = k; and 
Χο = ke is f(ki, ke) of (1.9). For the joint probability distribution of 
(X1, X2) we have then 


(1.10) P{X, = ky, ΧΩ = ke} =e Σὲ A S(ki, ke)/r! = 


r=k1+he 
om —. (\p1)*1(Ape2)*2 = (\p3)* a ο-λα--» (λρι) 1 λ}9)}}5 
κιϊκοὶὨ ksmo Keg! ky lke! 
or 
(1.11) P{X, = ky, Xo = ko} = p(ki; \p1)p(ke; Apo). 


Summing over kz we find that X; has the Poisson distribution p(k; Ap;). 
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(Problem VI, 27 paraphrases the same statement.) The joint distri- 
bution of (X,, X2) takes on the form of a multiplication table of the 
two marginal distributions {p(k;Api)} and {p(k; \pe)}. We shall ex- 
press this by saying that X, and Χο are independent. 


With the notation (1.3) the conditional probability of the event 
Y = yz, given that X = 5; (with f(z;) > 0), becomes 


2 ΠΕΣ ΤῸ p(x}; Ye) 
(1.12) PLY = ΨψΕ ΙΧ = xj} = ay 


It is convenient to abbreviate (1.12) to P{Y = y;,|X}; this defines the 
(conditional) distribution of Y for given X. A glance at tables 1 and 2 
shows that the conditional probability (1.12) is in general different 
from g(y,). This indicates that inference can be drawn from the values 
of X to those of Y and vice versa; the two variables are (stochastically) 
dependent. The strongest degree of dependence exists when Y is a 
function of X, that is, when the value of X uniquely determines Y. For 
example, if a coin is tossed n times and X and Y are the numbers of 
heads and tails, then Y = n — X. Similarly, when Y = X?, we can 
compute Y from X. In the joint distribution this means that in each 
row all entries but one are zero. If, on the other hand, p(x;, yz) = 
= f(x;)g9(yx) for all combinations of 2;, y,, then the events X = z; and 
Y = γε are independent; the joint distribution assumes the form of a 
multiplication table. In this case we speak of independent random 
variables. They occur in particular in connection with independent 
trials; for example, the numbers scored in two throws of a die are inde- 
pendent. An example of a different nature is found in example (d). 

Note that the joint distribution of X and Y determines the distribu- 
tions of X and Y, but that we cannot calculate the joint distribution 
of X and Y from their marginal distributions. If two variables X and 
Y have the same distribution, they may or may not be independent. 
For example, the two variables X; and Χο in table 2 have the same 
distribution and are dependent. 

All our notions apply also to the case of more than two variables. 
We recapitulate in the formal 


Definition. A random variable X 18 a function defined on a given 
sample space, that 1s, an assignment of a real number to each sample 
point. The probability distribution of X is the function defined in (1.1). 
If two random variables X and Y are defined on the same sample space, 
their joint distribution is given by (1.3) and assigns probabilities to all 
combinations (x;, yx) of values assumed by X and Y. This notion carries 
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over, in an obvious manner, to any finite set of variables X, Y, ..., W 
defined on the same sample space. These variables are called mutually 
independent if, for any combination of values, (x, y, ..., w) assumed by 
them, 


(1.18) P{X¥=2,Y=y,...,.W=w} = 
= P{X = 7} P{Y = y} --- P{W = wv}. 


In chapter V, section 4, we have defined the sample space corre- 
sponding to n mutually independent trials. Comparing this definition 
to (1.13), we see that if X, depends only on the outcome of the kth tral, 
then the variables Χι, ..., Xn are mutually independent. More generally, 
if a random variable U depends only on the outcomes of the first k 
trials, and another variable V depends only on the outcomes of the 
last n—k trials, then U and V are independent (cf. problem 39). 

We may conceive of a random variable as a labeling of the points 
of the sample space. This procedure is familiar from dice, where the 
faces are numbered, and we speak of numbers as the possible outcomes 
of individual trials. In conventional mathematical terminology we 
could say that a random variable X is a mapping of the original sample 
space onto a new space whose points are 21, 72, .... Therefore: 

Whenever {f(x;)} satisfies the obvious conditions (1.2) 2 1s legiti- 
mate to talk of a random variable X, assuming the values 21, Xo, ... 
with probabilities f(x1), f(t2), ... without further reference to the old 
sample space; a new one ts formed by the sample points 21, Xo, .... 
Specifying a probability distribution is equivalent to specifying a sample 
space whose points are real numbers. Speaking of two independent ran- 
dom variables X and Y with distributions {f(x;)} and {g(yz)} ts equiva- 
lent to referring to a sample space whose points are pairs of numbers 
(xj, yx) to which probabilities are assigned by the rule P{(2;, yx)} = 
= f(x;)g(yx). Similarly, for the sample space corresponding to a set of 
n random variables (X,Y, ..., W) we can take an aggregate of points 
(2, y,..-,w) in the n-dimensional space to which probabilities are 
assigned by the joint distribution. The variables are mutually independent 
if their joint distribution is given by (1.18). 


Example. (e) Bernoulli trials with variable probabilities. Consider 
n independent trials, each of which has only two possible outcomes, 
S and F. The probability of S at the kth trial is p,, that of F is 
gq, = 1— pr. If py = p, this scheme reduces to Bernoulli trials. The 
simplest way of describing it is to attribute the values 1 and 0 to S 
and F. The model is then completely described by saying that we 
have n mutually independent random variables X;, with distributions 
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P{X;, = 1} = py, P{X, = 0} = q,. This scheme is known under the 
confusing name of ‘“‘Pozsson trials.” [See examples (5.b) and XI(6.b).] 


It is clear that the same distribution can occur in conjunction with 
different sample spaces. If we say that the random variable X assumes 
the values 0 and 1 with probabilities 4, then we refer tacitly to a 
sample space consisting of the two points 0 and 1. However, the varia- 
ble X might have been defined by stipulating that it equals 0 or 1 
according as the tenth tossing of a coin produces heads or tails; in 
this case X is defined in a sample space of sequences (HHT'...), and 
this sample space has 2?° points. 

In principle, it is possible to restrict the theory of probability to 
sample spaces defined in terms of probability distributions of random 
variables. This procedure avoids references to abstract sample spaces 
and also to terms like ‘‘trials’”’ and ‘outcomes of experiments.” The 
reduction of probability theory to random variables is a short cut to 
the use of analysis and simplifies the theory in many ways. However, 
it also has the drawback of obscuring the probability background. The 
notion of random variable easily remains vague as “something that 
takes on different values with different probabilities.” But random 
variables are ordinary functions, and this notion is by no means peculiar 
to probability theory. 


Example. (f) Let X be a random variable with possible values 
21, X2, ... and corresponding probabilities f(x), f(a2), .... If it helps 
the reader’s imagination, he may always construct a conceptual experi- 
ment leading to X. For example, subdivide a roulette wheel into arcs 
l,, ἴω, ... whose lengths are as f(71):f(x2):.... Imagine a gambler 
receiving the (positive or negative) amount 2; if the roulette comes to 
rest at a point of /;. Then X is the gambler’s gain. In 7 trials, the gains 
are assumed to be n independent variables with the common distribu- 
tion {f(z;)}. To obtain two variables with a given joint distribution 
ἱρίω;, yx)} let an are correspond to each combination (2;, y,) and 
think of two gamblers receiving the amounts x; and y,, respectively. 


If X, Y, Z, ... are random variables defined on the same sample 
space, then any function F(X, Y, Z, ...) is again a random variable. 
Its distribution can be obtained from the joint distribution of X, Y, 
Z, ... simply by collecting the terms which correspond to combinations 
of (X, Y, Z, ...) giving the same value of F(X, Y, Z, ...). 


Example. (g) In the example illustrated by table 2 the sum 
X, + X_ is a random variable assuming the values 0, 1, 2, 3 with 
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probabilities q, 6g, 12q, ὃᾳ (where g = εἶσ). The product ΣΧ: ΧΩ assumes 
the values 0, 1, 2 with probabilities 15g, 6q, 6¢. 


2. EXPECTATIONS 


To achieve reasonable simplicity it is often necessary to describe 
probability distributions rather summarily by a few “typical values.”’ 
An example is provided by the median which was used above in con- 
nection with waiting times. The median x, of the distribution (1.1) 
is that value assumed by X for which P{X < 2%} “ $ and also 
P{X > 2} < 4. In other words, x» is chosen so that the probabilities 
of X exceeding or falling short of x, are as close to 5 as possible. 

However, among the typical values the expectation or mean 1s by 
far the most important. It lends itself best to analytical manipula- 
tions, and it is preferred by statisticians because of a property known 
as sampling stability. Its definition follows the customary notion of 
an average. If in a certain population n, families have exactly k chil- 
dren, the total number of families is n = no + πὶ + Ne +... and the 
total number of children m = n, + 20. + 3n3+.... The average 
number of children per family is m/n. The analogy between proba- 
bilities and frequencies suggests the following 


Definition. Let X be a random variable assuming the values x1, 15, 
... with corresponding probabilities f(x1), f(xe), .... The mean or 
expected value of X 18 defined by 


(2.1) E(X) = 2anf(rx) 


provided that the series converges absolutely. In this case we say that X 
has a finite expectation. If =|xx|f(a,) diverges, then we say that X has 
no finite expectation. 


It goes without saying that the most common random variables have 
finite expectations; otherwise the concept would be impractical. How- 
ever, variables without finite expectations occur in connection with 
important recurrence problems in physics. The terms mean, average, 
and mathematical expectation are synonymous. We also speak of the 
mean of a distribution instead of referring to a corresponding random 
variable. The notation E(X) is generally accepted in mathematics 
and statistics. In physics ΣΧ, <X>, <X>ay are common substitutes 
for E(X). 

We wish to calculate expectations of functions such as X”. This 
function is a new random variable assuming the values 22; in general, 
the probability of X? = 2,? is not f(x;) but f(rz) + f(—ze) and E(X?) 
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is defined as the sum of 2,%7{f(az) + f(—zx)} for all k such that x,>0. 
Obviously 


(2.2) E(X*) = 2ay"f(zx) 


provided the series converges. The same procedure of collecting terms 
leads to the general 


Theorem 1. Any function $(x) defines a new random variable o(X). 
If ¢(X) has finite expectation, then 


(2.3) E(¢(X)) = Σφί( fee); 


the series converges absolutely if, and only if, E(@(X)) exists. For any 
constant a we have E(aX) = aE(X). 


If several random variables X,, ..., X, are defined on the same 
sample space, then their sum X; +...+ X, is a new random variable. 
Its possible values and the corresponding probabilities can be readily 
found from the joint distribution of the X, and thus E(X; +...-+ X,) 
can be calculated. A simpler procedure is furnished by the following 
Important 


Theorem 2. If X;, Xo, ..., Xn are random variables with expecta- 
tions, then the expectation of their sum exists and is the sum of their 
expectations: 


(2.4) E(X; +...+ Xn) = E(Xi) +...+ E(X,). 


Proof. It suffices to prove (2.4) for two variables X and Y. Using 
the notation (1.3), we can write 


(2.5) E(X) + ἘΠ) = 2 xip(xj, Ye) + DY YEP (ts, Ye), 
jr 7.1 


the summation extending over all possible values x;, y; (which need 
not be all different). The two series converge; their sum can there- 
fore be rearranged to give 2j4(x%; + yx)p(x;, yx), Which is by definition 
the expectation of X + Y. This accomplishes the proof. 


Clearly, no corresponding general theorem holds for products; for 
example, E(X?) is generally different from (E(X))*. Thus, if X is the 
number scored with a balanced die, E(X) = 4, but E(X”) = (1 +4+ 
+9-+ 16+ 25 + 36)/6 = %+. However, the simple multiplication 
rule holds for mutually independent variables. 


Theorem 3. Jf X and Y are mutually independent random variables 
with finite expectations, then their product 1s a random variable with finite 
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expectation and 
(2.6) E(XY) = E(X)E(Y). 


Proof. To calculate E(XY) we should multiply each possible value 
x;y, With the corresponding probability. We have already remarked 
that the values x; in the definition (2.1) need not be different. Hence 


@.7) BY) = Demnsteao) = {Dasteo} |D now}: 

jrk j k 
the rearrangement being justified since the series converge absolutely. 
This proves the theorem. By induction the same multiplication rule 
holds for any number of mutually independent random variables. 


It is convenient to have a notation also for the expectation of a con- 
ditional probability distribution. If X and Y are two random variables 
with the joint distribution (1.3), the conditional expectation E(Y¥ |X) of 


Y for given X is the function 
ΣΟ ψερία;» Ye) 
k 
Τὴ 
provided the series converges absolutely and f(xj) > 0 for all 2. 


3. EXAMPLES AND APPLICATIONS 


(a) Binomial distribution. Let S, be the number of successes in n 
Bernoulli trials with probability p for success. We know that S, has 
the binomial distribution {b(k; n, p)}, whence E(S,,) = Zkb(k; n, p) = 
= np=b(k—1;n—1, p). The last sum includes all terms of the bi- 
nomial distribution for πὶ — 1 and hence equals 1. Therefore the mean 
of the binomial distribution is 
(3.1) E(S,) = np. 

The same result could have been obtained without calculation by a 
method which is often expedient. Let Χμ be the number of successes 
scored at the kth trial. This random variable assumes only the values 0 


and 1 with corresponding probabilities g and p. Hence E(X;) = 0-q + 
+ 1-p = p, and since 


(3.2) SX ie Xe 


we get (3.1) directly from (2.4). 
(Ὁ) Poisson distribution. If X has the Poisson distribution p(k; \) = 
= e—y*/k! (where k = 0, 1, ...) then 


E(X) = Lkp(k;d) = AZp(k—1;d). 


(2.8) > yeP{Y = yx |X = 2;} = 
π | 
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The last series contains all terms of the distribution and therefore adds 
to unity. Accordingly, the Poisson distribution {e—\*/k!} has the 
mean Δ. 

(c) Negative binomial distribution. Let X be a variable with the 
geometric distribution P{X = k} = q*p where k = 0, 1, 2, .... Then 
E(X) = φρᾷ + 24 + 8ᾳ2 +...). On the right we have the derivative 
of a geometric series so that E(X) = gp(1 — ῳ) 2 = qg/p. We have 
seen in chapter VI, section 8, that X may be interpreted as the number 
of failures preceding the first success in a sequence of Bernoulli trials. 
More generally, we have studied the sample space corresponding to 
Bernoulli trials which are continued until the nth success. For r < ἢ, 
let X, = X, and let X, be the number of failures following the 
(r—1)st success and preceding the rth success. Then each X, has the 
geometric distribution {g*p}, and E(X,) = q/p. The sum Y, = X; + 
+...+ X, is the number of failures preceding the rth success. In 
other words, Y, is a random variable whose distribution is the negative 
binomial defined by either of the two equivalent formulas VI(8.1) or 
VI(8.2). It follows that the mean of this negative binomial is rq/p. 
This can be verified by direct computation. From VI(8.2) it is clear 
that kf(k;7, p) = rp—'qf(k—1;7r+1, p), and the terms of the distribu- 
tion {f(k—1;r+1, p)} add to unity. This direct calculation has the 
advantage that it applies also to non-integral r. On the other hand, the 
first argument leads to the result without requiring knowledge of the 
explicit form of the distribution of X,; +...+ X,. 

(d) Waiting times in sampling. A population of N distinct elements 
is sampled with replacement. Because of repetitions a random sample 
of size r will in general contain fewer than r distinct elements. As the 
sample size increases, new elements will enter the sample more and 
more rarely. We are interested in the sample size ὃ, necessary for 
the acquisition of r distinct elements. (As a special case, consider the 
population of N = 365 possible birthdays; here S, represents the num- 
ber of people sampled up to the moment where the sample contains r 
different birthdays. A similar interpretation is possible with random 
placements of balls into cells. Our problem is of particular interest to 
collectors of coupons and other items where the acquisition can be 
compared to random sampling.*) 


8G. Polya, Eine Wahrscheinlichkeitsaufgabe zur Kundenwerbung, Zeitschrift fir 
Angewandte Mathematik und Mechanik, vol. 10 (1930), pp. 96-97. Polya treats a 
slightly more general problem with different methods. There exists a huge litera- 
ture treating variants of the coupon collector’s problem. [Cf. problems 24, 25, 
XI, 12-14, and II(11.12).] 
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The first element enters the sample at the first drawing. The num- 
ber of drawings from the second up to and including the drawing at 
which a new element enters the sample is a random variable X,; gener- 
ally, let X, be the number of drawings following the selection of the 
rth element up to and including the selection of the next new element. 
Then 5, = 1 + X; +...+ X,_1 is the sample size at the moment that 
the rth element enters the sample. Once the sample contains k differ- 
ent elements the probability of drawing a new one is at each drawing 
p = (N —k)/N. The number, X;, of drawings up to and including 
the drawing of a new element equals one plus the number of failures 
preceding the first success in Bernoulli trials with p = (N — k)/N. 
Therefore E(X;) = 1+ ¢/p = N/(N — k) and, from the addition the- 
orem (2.4), 


1 1 1 1 
ἢ E(S,) = N .-- ἘΠ ἘΞ πο Ὁ 
(3.3) E(S,) |= +o ++ eet 


For r= N we get the expected number of drawings necessary to 
exhaust the entire population. For N = 10 we have E(Sj9) = 29.29..., 
and E(S;) = 6.46.... This means that we can expect to cover half 
the population in about six to seven drawings, whereas the second half 
requires some 23 more drawings. A reasonable approximation to (3.3) 
for large N is 


(3.4) Elo ve +] 
In particular, for any fraction a < 1 the expected number of drawings 
required to obtain a sample containing about the fraction a of the entire 
population is, for large N, approximately N log [1/(1 — a)]; the expected 
number of drawings necessary to have all N elements included in the sam- 
ple is, approximately, N log N. Note that our results are again ob- 
tained without use of the distribution. 

(6) An estimation problem. A bowl contains balls numbered 1 to N. 
Let X be the largest number drawn in n drawings when random sampling 
with replacement is used. The event X < k means that each of n 
numbers drawn is less than or equal to k and therefore P{X < k} = 
= (k/N)". Hence the probability distribution of X is given by 


(8.5) p= P{X =k} = P(X Ξ 1) —P(K<k-1} = 


= {k" — ἃ — 1)"}N—. 
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It follows that 


(3.6) E(X) = bs kp, = N™ ΣΙ [Ket = (= 1)°T = ἃ τὸ 15} = 


k=1 
N 
= N-” {neti on » (k i ων 
k=1 
For large N the last sum is approximately the area under the curve 
y = x” from 2 = 0 to x = N, that is, N”*!/(n + 1). It, follows that 
for large N 


(3.7) E(X) = 


n 
N. 

n+1 
If a town has N = 1000 cars and a sample of n = 10 is observed, the 
expected number of the highest observed license plate (assuming ran- 
domness) is about 910. (The median is 934.) The practical statistician 
uses the observed maximum in a sample to estimate the unknown true 
number N. This method was used during the last war to estimate 
enemy production (cf. problems 8-11). 

(f) Banach’s match box problem. In chapter VI, section 8, we found 
the distribution 


2N --ὐὶὶ 1 


for the number X of matches left at the moment when the first box is 
found empty. We are unable to calculate the expectation E(X) = μ 
in a direct way, but the following indirect way is applicable in many 
similar cases. Using the fact that the u, add to unity (which is not 
easily verified), we find 


a 2 er — 7 1 
89) N-u=DW- w= Dwar 
r=0 pO —r/2 


By a simple operation on the binomial coefficients the last sum is 
transformed into 


= -1\ 1 
(3.10) Ten - ἡ (Σ τς ες - 


r=0 —r-—l1 
_ 2N = 1. ey 
» Urqi — 2 Dd (r + 1)Ur41. 
r=0 r=0 


The last sum is identical with the sum defining » = E(X). In the 
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first sum all u, except wo occur, and hence the terms add to 1 — wu. 
Thus from (3.9) and (3.10) 


ΟΝ +1 
(3.11) Nike (1 — ue) =. 
or 

ΟΝ +1/2 


Using Stirling’s formula, we find 
(3.18) p= 2(Ν7π)ὶ} — 1. 


In particular, in the distribution of chapter VI, table 8, we had N = 50. 
For it » = 7.04... and the median is 6. 


4. THE VARIANCE 


Let X be a random variable with distribution {f(;)}, and let r > 0 
be an integer. If the expectation of the random variable X’, that 1s, 


(4.1) E(X") = 22," f(x;), 


exists, then it is called the rth moment of X about the origin. If the series 
does not converge absolutely, we say that the rth moment does not 
exist. Since |X|"~> <|X|" + 1, it follows that whenever the rth moment 
exists so does the (r—1)st, and hence all preceding moments. 

Moments play an important role in the general theory, but in the 
present volume we shall use only the second moment. If it exists, so 
does the mean 


(4.2) u = E(X). 

It is then natural to introduce instead of the random variable its 
deviation from the mean, X — μι Since (x — nw)? < 2(2? + wu”) we see 
that the second moment of X — μ exists whenever E(X”) exists. We 
find 


(4.3) E((X — μ)2 = Do (x)? — Qua; + μα). 


2 
Splitting the right side into three individual sums, we find it equal to 
E(X”) — 2uE(X) + μ᾽ = E(X’) — p’. 


Definition. Let X be a random variable with second moment E(X?) 
and let μ = E(X) be tts mean. We define a number called the variance of 


X by 
(4.4) Var(X) = E((K — »)?) = E(R*) — p’. 


Its positive square root (or zero) 18 called the standard deviation of X. 
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For simplicity we often speak of the variance of a distribution with- 
out mentioning the random variable. “Dispersion” is a synonym for 
the now generally accepted term ‘‘variance.”’ 


Examples. (a) If X assumes the values --c, each with probability 
4, then Var(X) = οἷ. 

(Ὁ) If X is the number of points scored with a symmetric die, then 
Var(X) = 2(1? + 2? +...+ 62 — (2) = 33. 

(c) For the Poisson distribution p(k; Δ) the mean is ἃ [ef. example 
(3.b)] and hence the variance Dk?p(k; A) — \7 = ADkp(kK—1;A) — 2 = 
= AL(k — 1)p(k—1;) + ADp(kK—1;) — dX? = 7 -Ἐλ --λὲ =X. In 
this case mean and variance are equal. 

(d) For the binomial distribution [ef. example (3.a)] a similar com- 
putation shows that the variance is 


Lk*b(k; n, p) — (np)? = npzkb(k—1;n—1, p) — (np)? = 


np{(n — 1)p + 1} — (np)? = npg. 


The usefulness of the notion of variance will appear only gradually, 
in particular, in connection with limit theorems of chapter X. Here 
we observe that the variance is a rough measure of spread. In fact, if 
Var(X) = Z(x; — »)*f(a;) is small, then each term in the sum is small. 
A value z; for which |x; — u| is large must therefore have a small 
probability f(z;)._ In other words, in case of small variance large devia- 
tions of X from the mean μ are improbable. Conversely, a large vari- 
ance indicates that not all values assumed by X lie near the mean. 


Some readers may be helped by the following interpretation in mechanics. Sup- 
pose that a unit mass is distributed on the z-axis so that the mass f(z;) is concen- 
trated at the point z;. Then the mean up is the abscissa of the center of gravity, 
and the variance is the moment of inertia. Clearly different mass distributions may 
have the same center of gravity and the same moment of inertia, but it is well 
known that the most important mechanical properties can be described in terms 
of these two quantities. 


If X represents a measurable quantity like length or temperature, 
then its numerical values depend on the origin and the unit of measure- 
ment. A change of the latter means passing from X to a new variable 
aX + ὃ, where a and ὃ are constants. Clearly Var(X + ὃ) = Var(X), 
and hence | 


(4.5) Var(aX + b) = a?Var(X). 


The choice of the origin and unit of measurement is to a large degree 
arbitrary, and often it is most convenient to take the mean as origin 
and the standard deviation as unit. We have done so in chapter VII, 
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when we introduced the normalized number of successes S,* = 
= (Sp, — np)/(npq)!. In general, if X has mean yw and variance 
o?(c > 0), then X — μ᾿ has mean zero and variance o”, and hence the 
variable 

X— up 
(4.6) X* = 


σ 


has mean 0 and variance 1. It is called the normalized variable corre- 

sponding to X. In the physicist’s language, the passage from X to X* 

would be interpreted as the introduction of dimensionless quantities. 
5. COVARIANCE; VARIANCE OF A SUM 


Let X and Y be two random variables on the same sample space. 
Then X + Y and XY are again random variables, and their distribu- 
tions. can be obtained by a simple rearrangement of the joint distribu- 
tion of X and Y. Our aim now is to calculate Var(X + Y). For that 
purpose we introduce the notion of covariance, which will be analyzed 
in greater detail in section 8. If the joint distribution of X and Y is 
{p(xz, yx)}, then the expectation of XY is given by 


(5.1) E(XY) = LxjYynp x}; Yk); 


provided, of course, that the series converges absolutely. Now 
[χγνε| < (xj? + yx”)/2 and therefore E(XY) certainly exists if E(X?) 
and E(Y’) exist. In this case there exist also the expectations 


(5.2) Me = E(X), wy, = E(Y), 


and the variables X — uw, and Y — μι have means zero. For their 
product we have from the addition rule of section 2 


(5.8) E((X — μ) — wy)) = E(XY) — μΕ(Υ) — μνΕ(Χ)  μεμν = 
= E(XY) — μεμγ. 
Definition. The covariance of X and Y 1s defined by 
(5.4)  Cov(X, Y) = E((X — uz)(Y — wy)) = ECKY) — pemy. 
This definition 1s meaningful whenever X and Y have finite variances. 


We know from section 2 that for independent variables E(XY) = 
= E(X)E(Y). Hence from (5.4) we have 


Theorem 1. If X and Y are independent, then Cov(X, Y) = 0. 


Note that the converse 18 not true. For example, a glance at table 1 
shows that the two variables are dependent, but their covariance van- 
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ishes nevertheless. We shall return to this point in section 8. The 
next theorem is important, and the addition rule (5.6) for independent 
variables is constantly applied. 


Theorem 2. If Xi, ..., Xn are random variables with finite variances 
σι, ..., on’, and Sy, = Xi +...+ Xn, then 


(5.5) Var(S,,) = > oy” + 2 >) Cov(X;, Xz) 


k=1 jk 
n 
the last sum extending over each of the (") pairs (X;, Χμ) withj « k. 


In particular, if the X; are mutually independent, then the addition rule 
(5.6) Var(Sn) = σι" - σοῦ +...+ on? 
holds. 


Proof. Put uw, = E(X;,) and m, = με +...+ μη = EG,). Then 
Sn -- Mn = Σ(Σ; — we) and 


(5.7) (8, — mn)? = D(X — we)? + 235 (Kj — w;)(Xe — μὲ). 


Taking expectations and applying the addition rule, we get (5.5). 
Equation (5.6) follows from the preceding theorem. 


Examples. (a) Binomial distribution {b(k;n, p)}. In example (3.a), 
the variables X;, are mutually independent. We have E(X;”) = 0-2¢ + 
+ 1-7» = p, and E(X;) = p. Hence o;” = p — p® = pq, and from 
(5.6) we see that the variance of the binomial distribution is npg. The 
same result was derived by direct computation in example (4.d). 

(Ὁ) Bernoulli trials with variable probabilities. Let Χ,, ..., Xn be 
mutually independent random variables such that Χμ assumes the 
values 1 and zero with probabilities p, and gq, = 1 — p, respectively. 
Then E(X;) = px and Var(Xz) = pe — pe” = pegr. Putting again 
S, = X, +...+ X, we have from (5.6) 


(5.8) Var(Sn) = Do pede- 
k= 

As in example (1.6) the variable S,, may be interpreted as the total 
number of successes in independent trials, each of which results in 
success or failure. Then p = (p; +...+ pn)/n is the average prob- 
ability of success, and it seems natural to compare the present situation 
to Bernoulli trials with the constant probability of success p. Such a 
comparison leads to a striking result. We may rewrite (5.8) in the 
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form Var(S,) = np — px”. Next, it is easily seen (by elementary cal- 
culus or simple induction) that among all combinations {p;} such that 
Sp, = np the sum =p,” assumes its minimum value when all p; are 
equal. It follows that, if the average probability of success p is kept 
constant, Var(S,) assumes tts maximum value when py =...= Pn = DP. 
We have thus the surprising result that the variability of pz, or lack of 
uniformity, decreases the magnitude of chance fluctuations as measured 
by the variance. For example, the number of annual fires in a com- 
munity may be treated as a random variable; for a given average 
number, the variability is maximal if all households have the same 
probability of fire. Given a certain average quality p of n machines, 
the output will be least uniform if all machines are equal. (An applica- 
tion to modern education is obvious but hopeless.) 

(c) Card matching. A deck of n numbered cards is put into random 
order so that all n! arrangements have equal probabilities. The num- 
ber of matches (cards in their natural place) is a random variable S, 
which assumes the values 0, 1, ..., n. Its probability distribution 
was derived in chapter IV, section 4. From it the mean and variance 
could be obtained, but the following way is simpler and more instruc- 
tive. 

Define a random variable X; which is either 1 or 0, according as 
card number k is or is not at the kth place. ThenS, = X; +...+ Xp. 
Now each card has probability 1/n to appear at the kth place. Hence 
P{X, = 1} = 1/n and P{X,; = 0} = (κυ — 1)/n. Therefore E(X;) = 
= 1/n, and it follows that E(S,) = 1: the average is one match per 
deck. To find Var(S,,) we first calculate the variance σμ of Xz: 


| ὡς - Δ 1? »--1 
(5.9) Fe eee 


nN 7ὺ 


Next we calculate E(X;X;). The product Χ,Χ is 0 or 1; the latter is 
true if both card number 7 and card number k are at their proper 
places, and the probability for that is 1/n(n — 1). Hence 


1 


eT 1 1 1 
ὡς cae ads yee 


4 For stronger results in the same direction see W. Hoeffding, On the distribution 
of the number of successes in independent trials, Annals of Mathematical Statistics, 
vol. 27 (1956), pp. 713-721. 
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Thus finally 


(5.11) δ᾽ =n +2 (") 
7ὺ 


27 n?(n -- 1) ᾿ 


We see that both mean and variance to the number of matches are 
equal to one. This result may be applied to the problem of card guess- 
ing discussed in chapter IV, section 4. There we considered three 
methods of guessing, one of which corresponds to card matching. The 
second can be described as a sequence of n Bernoulli trials with prob- 
ability p = 1/n, in which case the expected number of correct guesses 
is np = 1 and the variance npg = (n — 1)/n. The expected numbers 
are the same in both cases, but the larger variance with the first method 
indicates greater chance fluctuations about the mean and thus promises 
a slightly more exciting game. (With more complicated decks of cards 
the difference between the two variances is somewhat larger but never 
really big.) With the last mode of guessing the subject keeps calling 
the same card; the number of correct guesses is necessarily one, and 
chance fluctuations are completely eliminated (variance 0). We see 
that the strategy of calling cannot influence the expected number of 
correct guesses but has some influence on the magnitude of chance 
fluctuations. 

(d) Sampling without replacement. Suppose that a population con- 
sists of b black and g green elements, and that a random sample of size 
r is taken (without possible repetitions). The number ὃν of black 
elements in the sample is a random variable with the hypergeometric 
distribution (chapter II, section 6) from which the mean and the vari- 
ance can be obtained by direct computation. However, the following 
method is preferable. Define the random variable X; to assume the 
values 1 or Ὁ according as the kth element in the sample is or is not 
black (k <r). For reasons of symmetry the probability that Χμ = 1 
is b/(b + g), and hence 


bg 
(ὃ +g)? 


Next, if γέ k, then X;X; = 1 if the jth and kth elements of the sample 
are black, and otherwise X;X, = 0. The probability of Χ,Χ = 1 is 
b(b — 1)/(6 + g)(b + g — 1), and therefore 


b(b — 1) 
b+gb+g—-1 
O+97b+g—-1 


b 
(5.12) E(X;) = ar Var(X;) = 


(5.13) E(X;X;) = 


Cov(X;X;) = 
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Thus, 


(5.14) E(S,) =—” νὰ εις ἔ αὶ, 

; ,))]ξΞ --- ar(S,) = ---- {1 — ———_—_}.. 
b+ g (ὁ + 9)" b+g-1 

In sampling with replacement we would have the same mean, but the 

variance would be slightly larger, namely, rbg/(b + g)?. 


6. CHEBYSHEV’S INEQUALITY 5 


It has been pointed out that a small variance indicates that large 
deviations from the mean are improbable. This statement is made 
more precise by Chebyshev’s inequality, which is an exceedingly useful 
and handy tool. 


Theorem. Let X be a random variable with mean μ = E(X) and 
variance o* = Var(X). Then for any t > 0 


on 


(6.1) P{|X — w|> ¢} Sa 

Proof. The variance is defined in (4.3) by a series with positive 
terms. Delete all terms for which |x; — μ| < ¢; this cannot increase 
the value of the series, and hence 


(6.2) σῇ > D*(x; — w)*f(2;) 


where the star indicates that the summation extends only over those j 
for which |v; — »|> ὁ. It is then clear that 


(6.3) Z* (a; -- w)*f(a;) > PZ*f(x;) = PP{|XK — wl > 8} 


which proves the theorem. 

Chebyshev’s inequality must be regarded as a theoretical tool rather 
than a practical method of estimation. Its importance is due to its 
universality, but no statement of great generality can be expected to 
yield sharp results in individual cases. 


Examples. (a) If X is the number scored in a throw of a true die, 
then [cf. example (4.b)], μ = $, σῇ = 38. The maximum deviation of 
X from p is 2.5 ~ 30/2. The probability of greater deviations is zero, 
whereas Chebyshev’s inequality only asserts that this probability is 
smaller than 0.47. 

(6) For the binomial distribution {b(k; n, p)} we have [cf. example 
(5.a)] μ = np, σῇ = npg. For large n we know that 


(6.4) P{|S, — np| > x(npq)t} ~ 1 — ®(z) + &(—2). 


5 P, L. Chebyshev (1821-1894). 
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Chebyshev’s inequality states only that the left side is less than 1,2; 
this is obviously a much poorer estimate than (6.4). 


*7, KOLMOGOROV’S INEQUALITY 9 
As an example of more refined methods we prove: 


Let Xi, ..., Xn be mutually independent variables with expectations 
με = E(X;z) and variances o;,7. Put 


(7.1) S,; = Ki +...+ X, 
and 
(7.2) me = E(Sz) = μι +... + me, 


8. ΞΞ Var(Sz) = σι" + ia .+ OK. 


For every t > 0 the probability of the sumultaneous realization of the n 
inequalitres | 


(7.3) [Sz — σι] < tn, k=1,2,...,7 


is at least 1 — 12. 


For n = 1 this theorem reduces to Chebyshev’s inequality. For 
n > 1 Chebyshev’s inequality gives the same bound for the probability 
of the single relation |S, — mn| < tsp, so that Kolmogorov’s inequality 
is considerably stronger. 


Proof. We want to estimate the probability x that at least one of 
the inequalities (7.3) does not hold. The theorem asserts that x < ἐ 2. 
Define n random variables Y; as follows: Y, = 1 if 


(7.4) |S, — m,| > tsr 
but 
(7.5) |S, — m,|< ts, for k= 1,2,...,»—1; 


Y, = 0 for all other sample points. In words, Y, equals 1 at those 
points in which the vth of the inequalities (7.3) is the first to be violated. 
Then at any particular sample point at most one among the Y; is 1, 
and the sum Y; + Yo +...-+ Yn can assume only the values 0 or 1; 


* This section treats a special topic and should be omitted at first reading. 
¢ Uber die Summen zufalliger Gréssen, Mathematische Annalen, vol. 99 (1928), 
pp. 309-319, and vol. 102 (1929), pp. 484-488. 
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it is 1 if, and only if, at least one of the inequalities (7.3) is violated, 
and therefore 
(7.6) 7 -- P{Y,+...4+ Y, = 1}. 


Since Y; +...+ Y, 1s 0 or 1, we have ZY, < 1. Multiplying by 
(S, — m,)* and taking expectations, we get 


(7.7) Σ E(Y;(S, πὰ Mn)*) < aa 


k=1 
For an evaluation of the terms on the left we put 
(7.8) U;, = (S, = Mn) CF, (8 = mx) a » (X, as My). 
v=k-+1 
Then 
(7.9) E(¥(Sn — mn)*) = Ε( (Sz — m,)?) + 
+ 2E(Y;,U;(S, — mxz)) + E(¥,0;’). 


However, U; depends only on Χμ, ..., Xn while Y; and S; depend 
only on Xj, ..., Xz. Hence U; is independent of Υ (8 — m,) and 
therefore E(Y¥;,U;,(S, — mz)) = E(¥:.(S, — m,))E(Ux) = 0, since 
E(U;) = 0. Thus from (7.9) 


(7.10) Ε(Χ,(8, — mn)*) > E(Y; (Sz — mz)?”). 


But Y; # 0 only if |S, — m,| > ts,, so that Y,(S_, — my)? > fn? Vp. 
Hence, combining (7.7) and (7.10), we get 


(7.11) S,° > ts,7E(Y, +...+ Y,). 


Since Y; +...+ Y, equals either 0 or 1, the expectation to the right 
equals the probability x defined in (7.6). Thus xt? < 1 as asserted. 


*8. THE CORRELATION COEFFICIENT 


Let X and Y be any two random variables with means yu, and p, 
and positive variances o,” and σῇ. We introduce the corresponding 
normalized variables X* and Y* defined by (4.6). Their covariance is 
called the correlation coefficient of X, Y and is denoted by p(X, Y). Thus, 
using (5.4), 


(8.1) e(X, Y) = Cov(X*, Y*) = Cov(X, =), 


* This section treats a special topic and may be omitted at first reading. 
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Clearly this correlation coefficient is independent of the origins and 
units of measurements, that is, for any constants αι, a2, bi, be, with 
αι > QO, ας > O, we have p(a,X + 01, a2¥ + be) = p(X, Y). 

The use of the correlation coefficient amounts to a fancy way of 
writing the covariance.” Unfortunately, the term correlation is sugges- 
tive of implications which are not inherent in it. We know from section 
5 that p(X, Y) = 0 whenever X and Y are independent. It is important 
to realize that the converse is not true. In fact, the correlation coefficient 
p(X, Y) can vanish even if Y is a function of X. 


Examples. (a) Let X assume the values +1, +2 each with prob- 
ability 4. Let Y = X?. The joint distribution is given by p(—1, 1) = 
= p(1, 1) = p(2,4) = p(—2,4) = ¢. For reasons of symmetry 
p(X, Y) = 0 even though we have a direct functional dependence of 
Y on Χ. 

(b) Let U and V be independent variables with the same distribution, 
and let KX =U+V, Y=U-—V. ThenE(XY) = E(U”) — E(v’) =0 
and E(Y) = 0. Hence Cov(X, Y) = 0 and therefore also p(X, Y) = 0. 
For example, X and Y may be the sum and difference of points on two 
dice. Then X and Y are either both odd or both even and therefore 
dependent. | 


It follows that the correlation coefficient is by no means a general 
measure of dependence between X and Y. However, p(X, Y) is con- 
nected with the linear dependence of X and Y. 


Theorem. We have always |p(X, Y)|< 1; furthermore, p(X, Y) = 
= +1 only 1} there exist constants a and ὃ such that Y = aX + ὃ, except, 
perhaps, for values of X with zero probability. 


Proof. Let X* and Y* be the normalized variables. Then 
(8.2) Var(X* + Y*) = Var(X*) + 2 Cov(X*, Y*) + Var(Y*) = 
= 2(1 + p(X, Y)). 


The left side cannot be negative; hence | p(X, Y)| << 1. For (X, Y) = 1 
it is necessary that Var(X* — Y*) = 0 which means that with unit 
probability the variable X* — Y* assumes only one value. In this case 
X* — Y* = const., and hence Y = aX + const. with a = a,/o,. A 
similar argument applies to the case p(X, Y) = —1. 


7™The physicist would define the correlation coefficient as “dimensionless co- 
variance.” 
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9. PROBLEMS FOR SOLUTION 


1. Seven balls are distributed randomly in seven cells. Let X; be the num- 
ber of cells containing exactly 7 balls. Using the probabilities tabulated in 
chapter IT, section 5, write down the joint distribution of (Xe, X3). 

2. Two ideal dice are thrown. Let X be the score on the first die and Y 
be the larger of two scores. (a) Write down the joint distribution of X and Y. 
(6) Find the means, the variances, and the covariance. 

3. In five tosses of a coin let X, Y, Z be, respectively, the number of heads, 
the number of head runs, the length of the largest head run. Tabulate the 32 
sample points together with the corresponding values of X, Y, and Z. By 
simple counting derive the joint distributions of the pairs (X, Y), (X, Z), (Y, Z) 
and the distributions of X + Y and XY. Find the means, variances, covari- 
ances of the variables. 

4, The random variables X; and ΧΩ are independent and have the same 
geometric distribution {q*p}, where k = 0,1,.... Let Z be defined as the larger 
of X; and X, [in symbols, Z = max (Xj, X2)]. Derive the joint distribution 
of Z and X, and the distribution of Z. 

5. Let X; and X_ be independent random variables with Poisson distribu- 
tions {p(k;1)} and {p(&;A2)}. Prove that X, + ΣΧ has the Poisson distribu- 
tion { p(k; A1+A2)}. 

6. Continuation. Show that the conditional distribution of X1 given X; + Χο 
ts binomial, namely 

Al 
(9.1) P{X, = &/X: +X, =n} τοῦ (451, ver =) 

7. Let X; and Χ. be independent and have the common geometric distribu- 
tion {gp} (as in problem 4). Show without calculations that the conditional 
distribution of X; given KX, + Xe is uniform, that is, 


(9.2) P{X, = k|X, + Χο = n} so 

8. Let Xi, ..., K, be mutually independent random variables, each having 
the uniform distribution P{X; = k} = 1/N fork = 1, 2, ...,N. Let U, be 
the smallest among the Xi, ..., X, and V, the largest. Find the distributions of 
Ὁ, and V,. What is the connection with the estimation problem (3.e)? 


9. In the estimation problem (8.6) find the joint distribution of the largest 
and the smallest observation. Specialize to n = 2. (Hint: Calculate first 
PiX< 7, Y> s}.) 

10. Continuation. Find the conditional probability that the first two ob- 
servations are 7 and k, given that X = r. 


11. Continuation. Find E(X*) and hence an asymptotic expression for 
Var(X) as N — ὦ (with n fixed). 

12. Sampling inspection. Suppose that items with a probability p of being 
defective are subjected to inspection in such a way that the probability of an 
item being inspected is p’. We have four classes, namely, “acceptable and 
inspected,” “acceptable but not inspected,” etc. with corresponding proba- 
bilities pp’, pq’, p’g, gq’ where g = 1 — p,q’ = 1 — ρ΄. Weare concerned with 
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double Bernoulli trials [see example VI(9.c)]._ Let N be the number of items 
passing the inspection desk (both inspected and uninspected) before the first 
defective is found, and let K be the (undiscovered) number of defectives among 
them. Find the joint ee of N and K and the marginal distributions. 


13. Continuation. Find E (== Ν ΤΙ) and Cov(K,N). [In industrial prac- 


tice the discovered defective von is replaced by an acceptable one so that 

K/(N + 1) is the fraction of defectives and measures the quality of the lot. 
kK 

Note that E € τ 1) is ποῦ E(K)/E(N + 1)] 

14. In a sequence of Bernoulli trials let X be the length of the run (of either 
successes or failures) started by the first trial. Find the distribution of X, 
E(X), Var(X). | 

15. Continuation. Let Y be the length of the second run. Find the distribu- 
tion of Y, E(Y), Var(Y), and the joint distribution of X, Y. 


16. If two random variables X and Y assume only two values each, and if 
Cov(X, Y) = 0, then X and Y are independent. 


17. Birthdays. For a group of n people find the expected number of days 
of the year which are birthdays of exactly k people. (Assume 365 days and 
that all arrangements are equally probable.) 

18. Continuation. Find the expected number of multiple birthdays. How 
large should n be to make this expectation exceed 1? 

19. A man with n keys wants to open his door and tries the keys independ- 
ently and at random. Find the mean and variance of the number of trials 
(a) if unsuccessful keys are not eliminated from further selections; (Ὁ) if they 
are. (Assume that only one key fits the door. The exact distributions are 
given in chapter II, section 7, but are not required for the present problem.) 

20. Let (X, Y) be random variables whose joint distribution is the trinomial 
defined by (1.9). Find E(X), Var(X), and Cov(X, Y) (a) by direct computa- 
tion, (δ) by representing X and Y as sums of ἢ variables each and using the 
methods of section 5. | 

21. Find the covariance of the number of ones and sixes in n throws of a 
die. 

22. In the animal trapping problem VI, 24 prove that the expected number 
of animals trapped at the vth trapping is ngp’—. 

23. If X has the geometric distribution P{X = k} = q*p (where k = 

..), Show that Var(X) = φρ 2. Conclude that the negative binomial distribu- 
tion { f(k; r, p)} has variance rgp~? provided r is a positive integer. Prove by 
direct calculation that the statement remains true for all r > 0. 


24. In the waiting time problem (8.4) prove that 
1 2 γ-- Ἰ1 
᾿Ξ Dt @w—aet ct aor Toh 
Hint. Use the variance of the geometric distribution obtained in problem 


23. Incidentally, as N — © we find N~? Var(S,) = 17/6. 


25. Continuation. Let Y, be the number of drawings required to include 
r preassigned elements (instead of any 7 different elements as in the text). 


Var(S,) = N 
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Find E(Y,) and Var(Y,). (Note: The exact distribution of Y, was found in 
problem II(11.12) but is not required for the present purpose.) 


26. A large number, N, of people are subject to a blood test. This can be 
administered in two ways. (i) Each person can be tested separately. In this 
case N tests are required. (ii) The blood samples of & people can be pooled 
and analyzed together. If the test is negative, this one test suffices for the k 
people. If the test is positive, each of the & persons must be tested separately, 
and in all & + 1 tests are required for the k people. 

Assume the probability p that the test is positive is the same for all people 
and that people are stochastically independent. 

(a) What is the probability that the test for a pooled sample of k people will 
be positive? 

(b) What is the expected value of the number, X, of tests necessary under 
plan (ii)? 

(c) Which k will minimize the expected number of tests under plan (ii)? 
Do not try numerical evaluations, since the problem leads to a rather cumber- 
some equation for k. 


27. Sample structure. A population consists of r classes whose sizes are in 


ment. Find the expected number of classes not represented in the sample. 


28. Let X be the number of ἃ runs in a random arrangement of 71 alphas 
and re betas. The distribution of X is given in problem II(11.23). Find E(X) 
and Var(X). 


29. In Polya’s urn scheme [V(2.c)] let X, be one or zero according as the nth 
trial results in black or red. Prove p(Xn, Xm) = c/(b + r +c) forn γέ m. 


30. Continuation. Let S, be the total number of black balls extracted in 
the first n drawings (that is, S, = Xi +...+ X,). Find E(S,) and Var(S,). 
(Use problems V, 19 and V, 20; verify the result by means of the recursion 
formula of problem V, 22.) 


31. Stratified sampling. A city has n blocks of which n; have 2; inhab- 
itants each (nj + m2 +...= 7). Let m = Dn,;z7;/n be the mean number of 
inhabitants per block and put a? = Dn,z;?/n — m?. In sampling without re- 
placement r blocks are selected at random, and in each the inhabitants are 
counted. Let Xj, ..., X, be the respective number of inhabitants. Show that 


a*r(n — τὴ 


E(X; +...+X,) = mr Var(X: +...+ X,) = Ear 


(In sampling with replacement the variance would be larger, namely, a?r.) 


32. Length of random chains.? A chain in the z, y-plane consists of n links, 
each of unit length. The angle between two consecutive links is +a where a 
is a positive constant; each possibility has probability 4, and the successive 


8 This problem is based on a new technique developed during World War II. 
See R. Dorfman, The detection of defective members of large populations, Annals 
of Mathematical Statistics, vol. 14 (1943), pp. 486-440. In army practice, plan (ii) 
introduced up to 80 per cent savings. 

® This is the two-dimensional analogue to the problem of length of long polymer 
molecules in chemistry. The problem illustrates applications to random variables 
which are not expressible as sums of simple variables. 


220 RANDOM VARIABLES [TX.9 


angles are mutually independent. The distance L, from the beginning to the 
end of the chain is a random variable, and we wish to prove that 


(9.3) E(L,2) =n 


Without loss of generality the first link may be assumed to lie in the direc- 
tion of the positive x-axis. The angle between the kth link and the positive 
x-axis is a random variable S;,_1 where So = 0, S, = S,-1 + Xza@ and the X; 
are mutually independent variables, assuming the values +1 with probability 
4, The projections on the two axes of the kth link are cos S,_; and sin S;,_1. 
Hence for n > 1 


(9.4) L,? = (Fos s,) + (Σ sin S.). 
k=0 k=0 
Prove by induction successively for m < n 
(9.5) E(cos S,) = cos” a, E(sin S,,) = 0; 
(9.6) E((cos S,,)-(cos Sn)) = cos"—” a-E(cos? Sn) 
(9.7) E((sin S,,)-(sin S,,)) = cos"~” a-E(sin? S,,) 
(9.8) E(L,2) — E(L?_,) = 1+ 2cosa- noes 


(with Lo = 0) and hence finally (9.3). 


33. A sequence of Bernoulli trials is continued as long as necessary to obtain 
r successes, where r is a fixed integer. Let X be the number of trials required. 
Find 10 E(r/X). (The definition leads to infinite series for which a finite ex- 
pression can be obtained.) 

34. In a random placement of r balls fats n cells the probability of finding 
exactly m cells empty satisfies the recursion formula II(11.8). Let m, be the 
expected number of empty cells. From the recursion formula prove that 


Mry1=1+(1—n7)m,, and conclude m, = nfl - (1 = =) : 


35. Let S, be the number of successes in πὶ Bernoulli trials. Prove 
E(|S, — np|) = 2vgb(v; n, p) 


where ν is the integer such that np <<» < np + 1. 
36. Let {ΣΧ} be a sequence of mutually independent random variables with 
a common distribution. Suppose that the ΣΧ assume only positive values and 


10 This example illustrates the effect of optional stopping. If the number n of 
trials is fixed, the ratio of the number N of successes to the number n of trials is 
a random variable whose expectation is p. It is often erroneously assumed that 
the same is true in our example where the number r of successes is fixed and the 
number of trials depends on chance. If p = 3 and r = 2, then E(2/X) = 0.614 
instead of 0.5; for r = 3 we find E(3/X) = 0. 570. 
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that E(X,) = a and E(X;,~) = ὃ exist. Let S, = Χ; +...+ X,. Prove that 
E(S,,—") is finite and that E(X;/S,) = 1/n for k = 1, 2, ..., n. 
37. Continuation.“ Prove that 


E(= - πῆρ τὴ 
nr 


E (=) = 1+ (m — n)aE(S,—), if m> ἢ. 


38. Let Xi, ..., X» be mutually independent random variables with a com- 
mon distribution; let its mean be m, its varianceo”. Let X = (X,-++...+ X,)/n. 
Prove that ” ; ᾿ 

-- λ2ὴὴῚ -- ,,2 
_ E (2 & X) ) a, 

39. Let Xi, ..., ΣΧ, be mutually independent random variables. Let U be 

a function of Xj, ..., X, and V a function of X.41, ..., Kn (ὦ <n). Prove 


that U and V are mutually independent random variables. 


40. Generalized Chebyshev inequality. Let (x) > 0 for x > 0 be monotoni- 
cally increasing and suppose that E(¢(|X|)) = M exists. Prove that 
M 
5D" 
41. Schwarz inequality. For any two random variables with finite variances 


one has E?(XY) < E(X*)E(Y’). Prove this from the fact that the quadratic 
polynomial E((éX -++ Y)*) is non-negative. 


P{|X|>t}< 


i The observation that 37 can be proved by introducing 36 is due to K. L. Chung. 
12 This can be expressed by saying that 2(X, — X)?/(n — 1) is an unbiased esti- 
mator of σϑ. 


CHAPTER Χ 


Laws of Large Numbers 


1. IDENTICALLY DISTRIBUTED VARIABLES 


The limit theorems for Bernoulli trials derived in chapters VII and 
VIII are special cases of general limit theorems which cannot be treated 
in this volume. However, we shall here discuss at least some cases of 
the law of large numbers in order to reveal a new aspect of the expecta- 
tion of a random variable. 

The connection between Bernoulli trials and the theory of random 
variables becomes clearer when we consider the dependence of the 
number §S,, of successes on the number n of trials. With each trial S,, 
increases by 1 or 0, and we can write 


(1.1) S, = KX, +...+ Xe 


where the random variable X; equals 1 if the kth trial results in success 
and zero otherwise. Thus 8, is a sum of n mutually independent ran- 
dom variables, each of which assumes the values 1 and 0 with prob- 
abilities p and g. From this it is only one step to consider sums of the 
form (1.1) where the X; are mutually independent variables with an 
arbitrary distribution. The (weak) law of large numbers of chapter 
VII, section 4, states that for large n the average proportion of suc- 
cesses S,,/7 is likely to lie near p. This is a special case of the following 


Law of Large Numbers. Let {ΣΧ} be a sequence of mutually inde- 
pendent random variables with a common distribution. If the expecta- 
tion μ = E(X;) exists, then for every e > 0Oasn — οὦὦ 
X,-+...+ X, = 

n 


μ 


> «| +0; 


(1.2) Pp 


in words, the probability that the average S,/n will differ from the 
expectation by less than an arbitrarily prescribed ε tends to one. 
228 
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In this generality the theorem was first proved by Khintchine.} 
Older proofs had to introduce the unnecessary restriction that the vari- 
ance Var(X;) should also be finite.2- For this case, however, there exists 
a much more precise result which generalizes the DeMoivre-Laplace 
limit theorem for Bernoulli trials, namely the 


Central Limit Theorem. Let {Χμ} be a sequence of mutually inde- 
pendent random variables with a common distribution. Suppose that 
μ = E(X;) and o* = Var(Xx) exist and let S, = Xi +...+ Xn. Then 
for every fixed B 


(1.3) P Pe < a| + (8) 


where ®(x) is the normal distribution introduced in chapter VII, sec- 
tion 1. This theorem is due to Lindeberg; * Ljapunov and other authors 
had previously proved it under more restrictive conditions. It must be 
understood that this theorem is only a special case of a much more 
general theorem whose formulation and proof are deferred to the sec- 
ond volume. Here we note that (1.3) is stronger than (1.2), since it 


; ss 1 
gives an estimate for the probability that the discrepancy Sn -τ-α 


is larger than o/n}. On the other hand, the law of large numbers (1.2) 
holds even when the random variables X; have no finite variance so 
that it is more general than the central limit theorem. For this reason 
we shall give an independent proof of the law of large numbers, but 
first we illustrate the two limit theorems. 


Examples. (a) In a sequence of independent throws of a symmetric 
die let X; be the number scored at the kth throw. Then E(X;) = (1 + 
4+24+3+44+5+6)/6=3.5, and Var(X;) = (12+ 2? 4 324 
+ 4? + 5? + 67)/6 — (3.5)? = 38. The law of large numbers states 
that for large n the average score S,,/n is likely to be near 3.5. The 
central limit theorem states that the probability of |S, — 3.5n|< 
< a:(35n/12)! is about (a) — ®(—a). For n = 1000 and a = 1 we 
find that there is roughly probability 0.68 that 3450 « S, < 3550. 
Choosing for a the median value a = 0.6744, we find that there are 


1A. Khintchine, Sur la loi des grands nombres, Comptes rendus de l’ Académie des 
Sciences, vol. 189 (1929), pp. 477-479. Incidentally, the reader should observe 
the warning given in connection with the law of large numbers for Bernoulli trials 
at the end of chapter VI, section 4. 

2 A, Markov showed that the existence of E(| X;|!t?) for some a > 0 suffices. 

3 J. W. Lindeberg, Eine neue Herleitung des Exponentialgesetzes in der Wahr- 
scheinlichkeitsrechnung, Mathematische Zeitschrift, vol. 15 (1922), pp. 211-225. 
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roughly equal chances that S, lies within or without the interval 
3500 + 36. 

(b) Sampling. Suppose that in a population of N families there are 
Νὰ families with exactly k children (k = 0,1, ...; ΣΝ = N). Fora 
family chosen at random, the number of children is a random variable 
which assumes the value » with probability p, = N,/N. A sample of 
size n with replacement represents n independent random variables or 
“observations” X,, ..., Xn, each with the same distribution; S,/n is 
the sample average. The law of large numbers tells us that for suffi- 
ciently large random samples the sample average is likely to be near 
w= Zvp, = ZvN,/N, namely the population average. The central 
limit theorem permits us to estimate the probable magnitude of the 
discrepancy and to determine the sample size necessary for reliable 
estimates. In practice both μ and o? are unknown. However, it is 
usually easy to obtain a preliminary estimate of o’, and it is always 
possible to keep to the safe side. If it is desired that there be prob- 
ability 0.99 or better that the sample average S,,/n differ from the un- 
known population mean μ by less than τσ, then the sample size should 
be such that 
ws P| 


5, -- Np 


1 
« τι » 0.99. 
10 


The root of (x) — Φί--α) = 0.99 is x = 2.57..., and hence n should 
satisfy nt/10c > 2.57 or n > 660c”. A cautious preliminary estimate 
of o” gives us an idea of the required sample size. Similar situations 
occur frequently. Thus when the experimenter takes the mean of n 
measurements he, too, relies on the law of large numbers and uses a 
sample mean as an estimate for an unknown theoretical expectation. 
᾿ The reliability of this estimate can be judged only in terms of o”, and 
usually we are compelled to use rather crude estimates for o”. 

(c) The Powsson distribution. In chapter VII, section 4, we found 
that for large \ the Poisson distribution {p(k; A)} can be approximated 
by the normal distribution. This is really a direct consequence of the 
central limit theorem. Suppose that the variables X; have a Poisson 
distribution {p(k; y)}. Then S, has a Poisson distribution {p(k; ny)} 
with mean and variance equal to ny. Writing for ny, we conclude 
that asn — οὐ 


n 


(1.5) >, e n*/k! > &(8) 
καλ: βλὲ 


the summation extending over all k up to ἃ + 6A. It is now obvious 
that (1.5) holds also when ἃ approaches © in an arbitrary manner. 
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This theorem is used in the theory of summability of divergent series 
and is of general interest; estimates of the difference of the two sides 
in (1.5) are available from the general theory. 


Note on Variables without Expectation 


Both the law of large numbers and the central limit theorem become 
meaningless if the expectation » does not exist, but they can be re- 
placed by more general theorems supplying the same sort of informa- 
tion. In the modern theory variables without expectation play an im- 
portant role and many waiting and recurrence times in physics turn out 
to be of this type. This is true even of the simple coin-tossing game. 

Suppose that 7 coins are tossed one by one. For the kth coin let X, 
be the waiting time up to the first equalization of the accumulated 
numbers of heads and tails. The X; are mutually independent random 
variables with a common distribution: each X; assumes only even 
positive values and P{X; = 2r} = fo, with the probability distribution 
{for} defined in IIT(4.2). According to theorem 3 of chapter ΠῚ, sec- 
tion 4, the distribution of the sum S, = X; +...+ X, is given by 


(1.6) P{S, = 2r} = fy? 


with f§ defined in III(4.11). In chapter III, section 8(c), it was 
shown that asn — οὦ 


(1.7) PiS, < n7x} - 21 — Φα Ἢ). 


We have here a limit theorem of the same character as the central limit 
theorem with the remarkable difference that this time the variable S, /n*, 
rather than S,/n, possesses a limiting distribution. 

In physical language the X; represent independent measurements on 
the same quantity, and the limit theorem asserts that, in probability, 
the average S,/n increases linearly with n. The surprising conse- 
quences of this behavior were discussed in chapter ITI.4 


*2. PROOF OF THE LAW OF LARGE NUMBERS 


We proceed in two steps. First assume that σῇ = Var(X;,) exists 
and note that here Var(S,) = no”, by the addition rule IX(5.6). 
According to the Chebyshev inequality IX(6.1), we have for every 
ἐ» 0 

2 


nr 
(2.1) P{|Sn — mu|> ) <— 


* For an analogue to the law of large numbers in a case of variables without finite 
expectation see section 4 and problem 13. 
* This section treats a special topic and may be omitted at first reading, 
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For ¢ > en the left side is less than o”/e*n, which tends to zero. This 
accomplishes the proof. 

Next we drop the restriction that Var(X;) exists. This case is re- 
duced to the preceding one by the method of truncation which is an im- 
portant standard tool. Define two new collections of random variables 
depending on the Χμ as follows: 


U;, = Xz, V; = 0 if ΙΧ, < en; 
(2.2) 

U; = 0, Vi = X; if | Xz | > en. 
Here k = 1,...,n and e > Ois fixed. Then identically 
(2.3) X;, = Ux + Vi. 


If {f(x;)} is the common probability distribution of the variables X;, 
the sum 


(2.4) Z|; |f(aj) = A 
is finite since » = E(X;) was assumed to exist. Now 
(2.5) wn = E(Ux) = 2, aif (x3), 


the summation extending over those j for which |x;|< en. Clearly 
un — #asn — οὐ, and hence for all n sufficiently large and for arbi- 
trary 6 > 0 


(2.6) lun — wl « ὃ. 
Furthermore, from (2.5) and (2.4), 
(2.7) Var(U;) < E(U;”) < Ὁ Σ 5; f(x) « €An. 


The U; are mutually independent, and their sum U; + U2 +...+ Un 
can be treated exactly as the X;, in the case of finite variances; applying 
the Chebyshev inequality, we get the following analogue to (2.1) 
Var(U eA 
> 7 ae 
nb 5? 


U .. + U, 
(2.8) p 1 ie Memes ese εἰν 


n 


In view of (2.6) this implies 


σ, pee 0 
(2.9) py, 


| eA 
> zl νὴ 
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Note that there is a large probability that V; = 0. In fact 
1 
(2.10) P{V. 0} = Σὺ fe) <— DY [υ]αὺῤ, 
| 2; | >en EN | x; | >en 


and the last sum tends to 0 with increasing n. Therefore for n suffi- 
ciently large 


€ 
(2.11) P{V;, σε 0} <- 
nr 


and hence by the basic inequality I(7.6) 
(2.12) PiVi +...+Vn ~ 0} < ε. 


Now S, = (Ο, +...+ U,) + (Vi +...+V,), and therefore from 
(2.9) and (2.12) 


U ἘΠ, 
pu ee Pt, 
Tr 


n 


~~ Bw 
n 


18) >| 


> as 4 


eA 
+P +...+Vn #0) SS +e 


Since ε and 6 are arbitrary, the right side can be made arbitrarily 
small, and this proves the assertion. 


3. THE THEORY OF “FAIR”? GAMES 


For a further analysis of the implications of the law of large numbers 
we shall use the time-honored terminology of gamblers, but our dis- 
cussion bears equally on less frivolous applications, and our two basic 
assumptions are more realistic in statistics and physics than in gambling 
halls. First, we shall assume that our gambler possesses an unlimited 
capital so that no loss can force a termination of the game. (Dropping 
this assumption leads to the problem of the gambler’s ruin, which 
from the very beginning has intrigued students of probability. It is 
of importance in Wald’s sequential analysis and in the theory of sto- 
chastic processes, and will be taken up in chapter XIV.) Second, we 
shall assume that the gambler does not have the privilege of optional 
stopping; the number n of trials must be fixed in advance independently 
of the development of the game. In reality a player blessed with an 
unlimited capital would wait for a run of good luck and quit at an oppor- 
tune moment. He is not interested in the probable state at a pre- 
scribed moment, but only in the maximal fluctuations in the long run. 
Light is shed on this problem by the law of the iterated logarithm 
rather than by the law of large numbers (cf. chapter VIII, section 5). 
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_ The random variable X; will be interpreted as the (positive or nega- 
tive) gain at the kth trial of a player who keeps playing the same type 
of game of chance. The sum S, = X; +...+ X, is the accumulated 
gain in m independent trials. If the player pays for each trial an en- 
trance fee μ΄ (not necessarily positive), then np’ represents the accu- 
mulated entrance fees, and S, — ny’ the accumulated net gain. The 
law of large numbers applies when u = E(X;,) exists. It says roughly 
that for sufficiently large n the difference S, — ny is likely to be small 
in comparison to n. Therefore, if the entrance fee yp’ is smaller than μ, 
then, for large n, the player is likely to have a positive gain of the 
order of magnitude n(u — p’). For the same reason an entrance fee 
μ' > wis practically sure to lead to a loss. In short, the case μ΄ < u 
is favorable to the player, while μ΄ > uw is unfavorable. 

Note that nothing is said about the case uw’ = uw. The only possible 
conclusion in this case is that, for n sufficiently large, the accumulated 
gain or loss S, — ny will with overwhelming probability be small in com- 
parison with n. It is not stated whether S, — ny is likely to be posi- 
tive or negative, that is, whether the game is favorable or unfavorable. 
This was overlooked in the classical theory which called yp’ = μ a 
“fair” price and a game with uw’ = yu “fair.” Much harm was done by 
the misleading suggestive power of this name. It must be understood 
that a “fair”? game may be distinctly favorable or unfavorable to the 
player. 

It is clear that “normally” not only E(X;) but also Var(X;) exists. 
In this case the law of large numbers is supplemented by the central 
limit theorem, and the latter tells us that, with a ‘fair’ game, the 
long-run net gain 5, — ny is likely to be of the order of magnitude n} 
and that for large n there are about equal odds for this net gain to be 
positive or negative. Thus, when the central limit theorem applies, 
the term “fair’’ appears justified, but even in this case we deal with a 
limit theorem with emphasis on the words ‘‘long run.” 

For illustration, consider a slot machine where the player has a prob- 
ability of 10~* to win 10°—1 dollars, and the alternative of losing the 
entrance fee un’ = 1. Here we have Bernoulli trials, and the game is 
‘fair.’ In a million trials the player pays as many dollars in entrance 
fees. He may hit the jackpot 0, 1, 2, ... times. We know from the 
Poisson approximation to the binomial distribution that, with an accu- 
racy to several decimal places, the probability of hitting the jackpot 
exactly k& times is e~!/k!. Thus the player has probability 0.368... 
to lose a million, and the same probability of barely recovering his 
expenses; he has probability 0.184... to gain exactly one million, etc. 
Here 10° trials are equivalent to one single trial in a game with the 
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gain distributed according to a Poisson distribution (which could be 
realized by matching two large decks of cards; cf. chapter IV, section 4). 
Obviously the law of large numbers is operationally meaningless in such 
situations. Now all fire, automobile, and similar insurance is of the 
described type; the risk involves a huge sum, but the corresponding 
probability is very small. Moreover, the insured plays ordinarily only 
one trial per year, so that the number n of trials never grows large. 
For him the game is necessarily “unfair,” and yet it is usually econom- 
ically advantageous; the law of large numbers is of no relevance to him. 
As for the company, it plays a large number of games, but because of 
the large variance the chance fluctuations are pronounced. The pre- 
miums must be fixed so as to preclude a huge loss in any specific year, 
and hence the company is concerned with the ruin problem rather 
than the law of large numbers. 

When the variance is infinite, the term “fair game’ becomes an 
absolute misnomer; there is no reason to believe that the accumulated 
net gainS, — ny’ fluctuates around zero. In fact, there exist examples 
of “fair’’ games ὃ where the probability tends to one that the player 
will have sustained a net loss. The law of large numbers asserts that 
this net loss is likely to be of smaller order of magnitude than n. How- 
ever, nothing more can be asserted. If a, is an arbitrary sequence such 
that a,/n — 0, it is possible to construct a “fair” game where the 
probability tends to one that at the nth trial the accumulated net loss 
exceeds a. Problem 15 contains an example where the player has a 
practical assurance that his loss will exceed n/logn. This game is 
“fair,” and the entrance fee is unity. It is difficult to imagine that a 
player will find it “fair” if he is practically sure to sustain a steadily 
increasing loss. 


*4. THE PETERSBURG GAME 


In the classical theory the notion of expectation was not clearly dis- 
associated from the definition of probability, and no mathematical 
formalism existed to handle it. Random variables with infinite expecta- 
tions therefore produced insurmountable difficulties, and even quite 
recent discussions appear strange to the student of modern probability. 
The importance of variables without expectation has been stressed at 
the conclusion of section 1, and it seems appropriate here to give an 
example for the analogue of the law of large numbers in the case of 


°W. Feller, Note on the law of large numbers and “fair’’ games, Annals of 
Mathematical Statistics, vol. 16 (1945), pp. 301-304. 
* Starred sections treat special topics and may be omitted at first reading. 
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such variables. For that purpose we use the time-honored so-called 
Petersburg paradox.® 

A single trial in the Petersburg game consists in tossing a true coin 
until it falls heads; if this occurs at the rth throw the player receives 2” 
dollars. In other words, the gain at each trial is a random variable 
assuming the values 2', 22, 25, ... with corresponding probabilities 
271, 2-7, 2-3, .... The expectation is formally defined by =z2,f(z,) 
with x, = 2” and f(x,) = 27", so that each term of the series equals 1. 
Thus the gain has no finite expectation, and the law of large numbers 
is inapplicable. Now the game becomes less favorable to the player 
when amended by the rule that he receives nothing if a trial takes more 
than N tosses (i.e., if the coin falls tails N times in succession). In 
this amended game the gain has the finite expectation N, and the law 
of large numbers applies. It follows that after n trials the accumulated 
gain is likely to exceed nN for every N. The player can therefore ex- 
pect to have a net profit even if he pays an arbitrary fixed entrance fee 
μ' for each trial. This is true for every μ΄, but the larger yw’, the larger 
must n be in order that a positive gain be probable. The classical 
theory concluded that p’ = © is a “‘fair’’ entrance fee, but the modern 
student will hardly understand the mysterious discussions of this 
“paradox.” 

It is perfectly possible to determine entrance fees with which the 
Petersburg game will have all properties of a “fair’’ game in the classical 
sense, except that these entrance fees will depend on the number of 
trials instead of remaining constant. Variable entrance fees are un- 
desirable in gambling halls, but there the Petersburg game is impossible 
anyway because of limited resources. In the case of a finite expectation 
p = E(X;) > 0, a game is called ‘fair’ if for large n the ratio of the 
accumulated gain 8, to the accumulated entrance fees e, = ny’ is likely 
to be near 1 (that is, if the difference S, — e, is likely to be of smaller 
order of magnitude than e, = ny’). If E(X;) does not exist, we cannot 
put e, = np’ but must determine e, in another way. We shall say 
that a game with accumulated entrance fees 6,, 18 fair in the classical 
sense uf for every « > 0 
Sn 
see Fee | | > ἢ - 0. 
en 


(4.1) Ρ | 


This is the complete analogue of the law of large numbers where 
en, = np’. The latter is interpreted by the physicist to the effect that 


6This paradox was discussed by Daniel Bernoulli (1700-1782). Note that 
Bernoulli trials are named after James Bernoulli. 
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the average of n independent measurements is bound to be near μ. 
In the present instance the average of n measurements is bound to be 
near é,/n. Our limit theorem (4.1), when it applies, has a mathematical 
and operational meaning which does not differ from the law of large 
numbers. | 

We shall now show’ that the Petersburg game becomes “‘fair’’ in the 
classical sense if we put én = n Log n, where Log n is the logarithm to 
the base 2, that is, 24°" = n. 


Proof. We use the method of truncation of section 2, this time de- 
fining the variables Uz and V; (k = 1, 2, ..., n) by 


ὕ, = Xz, Vi; = 0 if Χ; <n Logn; 
(4.2) 
U; = 0, Vi = Xi if X,>n Log n. 


Again X; = U;, + Vz, and the U; are mutually independent. For every 
t we have P{ X; > t} < 2/t and hence P{V; τέ 0} < 2/(n Log ἢ), or 


- 0 


2 
(4.3) P{Vi + Vo+...+ Vn > 0) < 
Log n 


To verify (4.1) it suffices therefore to prove that 
(4.4) P{|U, +...+U, —nLogn|> en Logn} — 0. 


Put un = E(U;) and o,” = Var(U;); these quantities depend on ἢ, 
but are common to Ui, Us, ..., Un. If ris the largest integer such that 
2” < n Log n, then μῃ = r and hence for sufficiently large n 


(4.5) Logn < pn < Log n + Log Log ἢ. 
Similarly 
(4.6) on? «Ἐ( 2) = 24+ 2? +...4+ 27 < 2711] < Qn Log n. 


Since the sum U, +...+ U, has mean ny, and variance nop”, we 
have by Chebyshev’s inequality 
ῃσ,: 2 


2 


(4.7) P{|Uy; +...+ Un — nun| > επμαὶ) $5535 > 
EN Un εἴ Log n 


- 0. 


Now by (4.5) un ~ Log n, and hence (4.7) is equivalent to (4.4). 


7 This is a special case of a generalized law of large numbers from which necessary 
and sufficient conditions for (4.1) can easily be derived; cf. W. Feller, Acta Scien- 
tiarum Litterarum Univ. Szeged, vol. 8 (1937), pp. 191-201. 
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5. VARIABLE DISTRIBUTIONS 


Up to now we have considered only the case where the variables X;, 
have the same distribution. This situation corresponds to a repetition 
of the same game of chance, but it is more interesting to see what hap- 
pens if the type of game changes at each step. It is not necessary to 
think of gambling places; the statistician who applies statistical tests 
is engaged in a dignified sort of gambling, and in his case the distribu- 
tion of the random variables changes from occasion to occasion. 

To fix ideas we shall imagine that an infinite sequence of probability 
distributions is given so that for each n we have n mutually independent 
variables X,, ..., ΣΧ, with the prescribed distributions. We shall 
assume that the means and variances exist and put 


(5.1) με = E(Xz), σσιἶ = Var(Xz). 

The sum S,, = X,; +...-+ X, has also finite mean and variance 
(5.2) mn = E(S,), 8,2 = Var(Sn) 

given by 

(5.3) Mn = byte. tun, Sa? = σιῦ +... 4+ on? 


(ef. formulas IX(2.4) and IX(5.6)]. In the special case of identical 
distributions we had m,; = np, 8,2 = no”. 


The (weak) law of large numbers is said to hold for the sequence { X;} 
af for every e > 0 


[Sn — Mn| 
(5.4) Pp “ἘΞ τὸ 


> «| 0. 
n 


The sequence {X;} is said to obey the central limit theorem «af for every 


fixeda < 8B 
5, — Mn 
(5.5) P {e <——__ << 8 — Φί(β) -- Pa). 


Sy, 

It is one of the salient features of probability theory that both the 
law of large numbers and the central limit theorem hold for a sur- 
prisingly large class of sequences {X;}. In particular, the law of large 
numbers holds whenever the X; are uniformly bounded, that is, whenever 
there exists a constant A such that [Χμ] « A forall k. More generally, 
a sufficient condition for the law of large numbers to hold 1s that 

Sn 


(5.6) —— 0. 
n 
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This is a direct consequence of the Chebyshev inequality, and the 
proof given in the opening passage of section 2 applies. Note, however, 
that the condition (5.6) is not necessary (cf. problem 14). 

Various sufficient conditions for the central limit theorem have been 
discovered, but all were superseded by the Lindeberg ὃ theorem according 
to which the central limit theorem holds whenever for every «> Ὁ the 
truncated variables Ὅς defined by 


Uy = Xi — μα if |X, — με] < €8p, 
(5.7) 
σι = 0 if pe hee 


satisfy the conditions s, — © and 


(5.8) τ DEW) ἘΠῚ 


8." k=1 


If the X; are uniformly bounded, that is, if | X,|< A, then U, = 
= X;, — μὰ for all n which are so large that s,, > 24εἴ. The left side 
in (5.8) then equals 1. Therefore the Lindeberg theorem implies that 
every uniformly bounded sequence { X;} of mutually independent random 
variables obeys the central limit theorem, provided, of course, that 
S, —> ©. It was found that the Lindeberg conditions are also neces- 
sary for (5.5) to hold.* The proof is deferred to the second volume, 
where we shall also give estimates for the difference between the two 
sides in (5.5). 

In the case where the variables Χὰμ have a common distribution we 
found the central limit theorem to be stronger than the law of large 
numbers. This is not so in general, and we shall see that the central 
limit theorem may apply to sequences which do not obey the law of 
large numbers. 


Examples. (a) Let A > 0 be fixed, and let X, = +h, each with 
probability 4} (e.g., a coin is tossed, and at the kth iow the stakes 
are +k). Here μα = 0, 0,2 = k?, and 

n2rti 
(5.9) 8,2 = 1% ++ 27 4. g7h et A 


A+1 


8 J. W. Lindeberg, loc. cit. (footnote 3). 

° W. Feller, Uber den zentralen Grenzwertsatz der Wahrscheinlichkeitsrechnung, 
Mathematische Zeitschrift, vol. 40 (1935), pp. 521-559. There also a generalized 
central limit theorem is derived which may apply to variables without expectations. 
Note that we are here considering only independent variables; for dependent vari- 
ables the Lindeberg condition is neither necessary nor sufficient. | 
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The condition (5.6) is satisfied if \ < 4. Therefore the law of large 
numbers holds if \ < 4; we proceed to show that it does not hold if 
A> 5. 

For k=1, 2, ..., n we have |X,|=k* <n’, so that for 
n > (2. + 1)ε 2 the truncated variables U; are identical with the X,. 
Hence the Lindeberg condition applies for \ > 0, and 


2. + 1\3 
(5.10) Pp {2 < CYS) Sn < 4 — (8) -- Fa). 


It follows that S,, is likely to be of the order of magnitude πλτὲ, so 
that the law of large numbers cannot apply for \ > 4. We see that 
in this example the central limit theorem applies for all } > 0, but the 
law of large numbers only if \ < 5. 

(b) Consider two independent sequences of 1000 tossings of a coin 
(or emptying two bags of 1000 coins each), and let us examine the 
difference D of the number of heads. Let the tossings of the two se- 
quences be numbered from 1 to 1000 and from 1001 to 2000, respec- 
tively and define 2000 random variables X; as follows: If the kth 
coin falls tails, then X, = 0. If it falls heads, we put X; = 1 for 
k < 1000 and X, = —1, for k > 1000. Then D = X, + X.+...+ 
+ Xs000. Moreover, με = +4, depending on the sequence to which 
the coin belongs, 0,7 = 4+, meoo0 = 0, S20007 = 500. Therefore the prob- 
ability that the difference D will lie within the limits +(500)%a is 
®(a) — &(—a), approximately, and D is comparable to the deviation 
Seoo0 — 1000 of the number of heads in 2000 tossings from its expected 
number 1000. 

(c) An application to the theory of inheritance will illustrate the great 
variety of conclusions based on the central limit theorem. In chapter 
V, section 5, we have studied traits which depend essentially only on 
one pair of genes (alleles). We conceive of other characters (like height) 
as the cumulative effect of many pairs of genes. For simplicity, sup- 
pose that for each particular pair of genes there exist three genotypes 
AA, Aa, or aa. Let 21, 22, and x3 be the corresponding contributions. 
The genotype of an individual is a random event, and the contribution 
of a particular pair of genes to the height is a random variable X, 
assuming the three values 21, 22, 43 with certain probabilities. The 
height is the cumulative effect of many such random variables X;, Xo, 
..., %&m, and since the contribution of each is small, we may in first 
approximation assume that the height is the sum X; +...+ Xn. It 
is true that not all the Χμ are mutually independent. However, the 
central limit theorem holds also for large classes of dependent variables, 
and, besides, it is plausible that the great majority of the X; can be 
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treated as independent. These considerations can be rendered more 
precise; here they serve only as indication of how the central limit 
theorem explains why many biometric characters, like height, exhibit 
an empirical distribution close to the normal distribution. This theory 
permits also the prediction of properties of inheritance, e.g., the de- 
pendence of the mean height of children on the height of their parents. 
Such biometric investigations were initiated by F. Galton and Karl 
Pearson. 


*6. APPLICATIONS TO COMBINATORIAL ANALYSIS 


We shall give two examples of applications of the central limit 
theorem to problems not directly connected with probability theory. 
Both relate to the n! permutations of the n elements aj, ao, ... , An, to 
each of which we attribute probability 1/n!. 


(a) Inversions 


In a given permutation the element a; is said to induce r inversions 
if it precedes exactly r elements with smaller index (i.e., elements which 
precede a; in the natural order). For example, in (430601050204) the 
elements a; and ag induce no inversion, a3 induces two, a4 none, as 
two, and ag four. In (agasa4agaza,) the element a; induces k — 1 inver- 
sions and there are fifteen inversions in all. The number X;, of inver- 
sions induced by a, is a random variable, and S, = X; +... + Χ, is 
the total number of inversions. Here X; assumes the values 0,1,.... 
&—1, each with probability 1/k, and therefore 

k-1 


με ΞΞ ----- 


2 
1 - 22-Ε....- (ὦ -- 1)? f=) 


k 


2 


} 


(6.1) i ἠδ. ἡ 
-- 12 .Ψ 
The number of inversions produced by a; does not depend on the rela- 


tive order of ai, ag, ..., 4,1, and the X, are mutually independent. 
From (6.1) we get 


a, — 


ee oa) ερς ὰς n—1 — 1 2 

(6.2) oes ee _ e) .} 
2 4 4 

and 3 2 3 
| ek 2n? + 3n* — 5n n 

6.3 8,2) = — aie ἢν ξεν εξ ες τ ERT Oe Re 
(0.9) ἘΣ ) 72 36 


* This section treats a special topic and may be omitted. 
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For large n we have es, > n > Ux, and hence the variables U; of the 
Lindeberg condition are identical with Χμ. Therefore the central limit 


theorem applies, and we conclude that the number N,, of permutations 
2 


n a 
for which the number of inversions lies between the limits ri st τ πἦ is, 


asymptotically, given by n!{ ®(a) -- Φί(-- αὐ). In particular, for about 
one-half of all permutations the number of inversions lies between the 
limits (n?/4) + (0.11)n?. 


(b) Cycles 


Every permutation can be broken down into cycles, that is, groups 
of elements permuted among themselves. Thus in (agaga ;asa,a4) we 
find that a; and ag are interchanged, and that the remaining four ele- 
ments are permuted among themselves; this permutation contains two 
cycles. If an element is in its natural place, it forms a cycle so that 
the identical permutation (a1, de, ..., @,) contains as many cycles as 
elements. On the other hand, the cyclical permutations (ας, a3, ..., Gn, 
1), (3, Q4, ..+, Mn, αι, Ag) etc. contain a single cycle each. For the 
study of cycles it is convenient to describe the permutation by means 
of arrows indicating the places occupied by the several elements. For 
example, 1—~3—>4—1 indicates that a, is at the third place, ag at the 
fourth, and a, at the first, the third step thus completing the cycle. 
This description continues with a2, which is the next element in the natu- 
ral order. In this notation the permutation (a4, dg, α1, @3, @2, As, 47, Ae) 
is described by: 1--99---54--.] ; 2-5-6 8 52; 7-7. 

Let X; equal 1 if a cycle is completed at the kth step in this build-up; 
otherwise let X, = 0. (In the last example X3; = X7 = Xg = 1 and 
X, = X_ = Χ, = X; = X, = 0.) Clearly X, = 1 if, and only if, a, is 
at the first place. At the step number 1, 2,...,n wehaven,n—1,...,0 
choices, respectively, and among them just one leads to the completion 
of a cycle. Therefore ® X;, = 1 with probability 1/(n — k + 1) and 
Χ, = 0 with probability (n — k)/(n —k +1). The variables X; are 
mutually independent with means and variances 


δὴ 1 ἫΝ n—k 
(6.4) δον creme ee eh 
whence 

1 1 1 
(6.5) Mm, =~1l+—-4+—+...+-~logn 

2 3 n 


10 Formally, the distribution of Χμ depends not only on k but alsoonn. It suffices 
to reorder the Xz, starting from k = n down to k = 1, to have the distribution 
depend only on the subscript. 
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and 


(6.6) a a  - 

"gar (n — & +1)? 
5, = Xi +...-+ Xn is the total number of cycles. The average is m,; 
the number of permutations with cycles between log n + a(log n)* and 
log n + B(log n) is given by n!{ (8) — &(a)}, approximately. The re- 
fined forms of the central limit theorem give more precise estimates." 


ἘΠ. THE STRONG LAW OF LARGE NUMBERS 


The (weak) law of large numbers (5.4) asserts that for every par- 
ticular sufficiently large n the deviation |S, — m,| is likely to be small 
in comparison to ἢ. It has been pointed out in connection with Ber- 
noulli trials (chapter VIIT) that this does not imply that |S, — m,|/n 
remains small for all large n; it can happen that the law of large num- 
bers applies but that |S, — m,|/n continues to fluctuate between 
finite or infinite limits. The law of large numbers permits only the 
conclusion that large values of {S, — m,|/n occur at infrequent 
moments. 


n—k 
~ log ἢ. 


We say that the sequence X; obeys the strong law of large numbers if to 
every pair e > 0, 6 > 0, there corresponds an N such that there is prob- 
ability 1 — 6 or better that for every r > 0 all r + 1 inequalities 
|S, τ ma| 
——<— 5... οὐ 

n 


(7.1) 
will be satisfied. 


We can interpret (7.1) roughly by saying that with an overwhelming 
probability |S, — m,|/n remains small ® for all n > N. 


e n=N,N+1,...,N+r 


The Kolmogorov Criterion. The convergence of the series 


σ 2 
(7.2) > = 


11 A great variety of asymptotic estimates in combinatorial analysis were de- 
rived by other methods by V. Gonéarov, Du domaine d’analyse combinatoire, 
Bulletin de lV Académie Sciences URSS, Sér. Math. (in Russian, French summary), 
vol. 8 (1944), pp. 3-48. The present method is simpler but more restricted in scope; 
ef. W. Feller, The fundamental limit theorems in probability, Bulletin of the Ameri- 
can Mathematical Society, vol. 51 (1945), pp. 800-832. 

* This section treats a special topic and may be omitted. 

2 The general theory introduces a sample space corresponding to the infinite 
sequence {Xx}. The strong law then states that with probability one |S, — m,|/n 
tends to zero. In real variable terminology the strong law asserts convergence 
almost everywhere, and the weak law is equivalent to convergence in measure. 
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as a sufficient condition for the strong law of large numbers to apply to the 


sequence of mutually independent random variables X; with variances o;,”. 


Proof. Wet A, be the event that for at least one n with 2’! <n < 2” 
the inequality (7.1) does not hold. Obviously it suffices to prove that 
for all ν sufficiently large (v > log N) and all r 


P{A,} ἘΡίΑ,μ} +...+ PlArge} « ὃ, 


that is, that the series 2P{A,} converges. Now the event A, implies 
that for some n with 2’! < n < 2’ 


(7.3) ISn — my| + 2 


and by Kolmogorov’s inequality (chapter IX, section 7) 
(7.4) P{A,} < 4677-53-27”, 
Hence 


(7.5) > P{A,} < 4.7? > Ὁ: p> σι" = 4ε = σ D2 ςΞ 


y=1 y=1 2’ >k 
2 
Ok 
—2 Pea 
<8? Σ ᾿ 
k=1 


which accomplishes the proof. 
As a typical application we prove the 


Theorem. If the mutually independent random variables X; have a 
common distribution {f(x;)} and if μ = E(X,) exists, then the strong law 
of large numbers applies to the sequence { X;}. 


This theorem is, of course, stronger than the weak law of section 1. 
The two theorems are treated independently because of the method- 
ological interest of the proofs. For a converse cf. problem 17. 

Proof. We again use the method of truncation. Two new sequences 


of random variables are introduced by 

Ὁ; = Xz, Vi = 0 if |X,|<k, 
(7.6) 

U; = 0, Vi = Xz if |X;|> k. 


The U; are mutually independent, and we shall show that they satisfy 
Kolmogorov’s criterion. Clearly 


(7.7) σι < EU) = >) 27f(a)). 
“εἰ «ἃ 


τὶ 
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Put for abbreviation 


(7.8) y= Σ᾽  |a5|f(z;). 


y—1<S| xj|<» 


Then the series Za, converges since E(X;) exists. Moreover, from (7 7), 


(7.9) oR < αι + 26. + 303 +. < . ka; 
and 
) oR 4) 1 k o 00 1 00 
(7.10) Σὺ -ς Φ Σῦτε Σὸ ναν = La Σ) -- «2 Σ) αν « ὦ. 
k=1 k k=1 k pol p==l1 k=p k v=l1 
Finally 
(7.11) E(U;,) = ux = dy f(a) 


so that μὲ — mu and hence (μ, + μα +...+ un)/n > μ. Applying 
the strong law of large numbers to {U;}, we conclude that with prob- 
ability 1 — 6 or better 


(7.12) < € 


n 
n>) U, — w 
k=1 


for alln > N. It suffices now to prove that the V,, can be neglected, 
that is, that the probability of one or more V, with n > N being 
different from zero tends to 0 with N — o. The first Borel-Cantellj 
lemma (chapter VIII, section 3) applies with obvious verbal changes, 
and it suffices to prove that ZP{V, ~ 0} converges. Now 


On+1 On+2 On+3 
7.18) P{V, ~ 0} = ) Κ -- sige 
oe , nfo n a ae 
and hence 
nn” 2) αν ora) αν ν 
N=] p= po] n=!I ν 
as asserted. 


8. PROBLEMS FOR SOLUTION 


1. Prove that the law of large numbers applies in example (5.a) also when 
λ <0. The central limit theorem holds if \ > —3. 


2. Decide whether the law of large numbers and the central limit theorem 
hold for the sequences of mutually independent variables X, with distributions 
defined as follows (k > 1): 


(a) P{X, = +2*} = 5; 
(Ὁ) P{X, = +2*} = —~@k+1) | P{X, =Q}=1- Q-%. 
(c) P{X,= +k} = 3k, = =9P{X, = 0} = 1 -- 5’, 
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3. Ljapunov’s condition (1901). Suppose that for some fixed 6 > 0 we have 
E(|X;|7**) = Ax where A;/s,? < const. Show that Lindeberg’s conditions are 
satisfied if s, —> ©. 7 

4, Find sufficient conditions on {Z,} for the weak law of large numbers 
and/or the central limit theorem to hold for the mutually independent vari- 
ables {X,}, where {X;} assumes the values 


1 2 k 
0, = oi + op ™ rae = oe 1 


each with probability 1/(2k + 1). 
5. Do the same problem if X, assumes the values ax, —a,, and 0 with prob- 
abilities p,, p, and 1 — 2p,. 


Note: The following seven problems treat the weak law of large numbers for dependent 
variables. 


6. In problem V, 13 let X, = 1 if the kth throw results in red, and X, = 0 
otherwise. Show that the law of large numbers does not apply. 


7. Let the {X;,} be mutually independent and have a common distribution 
with mean uw and finite variance. 1 5, = X,-+...+ Xy, prove that the law 
of large numbers does not hold for the sequence {S,} but holds for a,S,, if 
Nan — 0. 

8. Let {X;,} be a sequence of random variables such that X; may depend on 
X;,—1 and Χμ: but is independent of all other X;. Show that the law of large 
numbers holds, provided the X; have bounded variances. 


9. If the joint distribution of (Xi, ..., X,) is defined for every n so that the 
variances are bounded and all covariances are negative, thelaw of large numbers 
applies. 

10. Continuation. Replace the condition Cov(X;, Χμ) < 0 by the assump- 
tion that Cov(X;, X,) — 0 uniformly as [7 —k| — ©. Prove that the law 
of large numbers holds. 

11. If {S,| < en and Var(S,,) > an’, then the law of large numbers does not 
apply to {Χμ}. 

12. In the Polya urn scheme [example V(2.c)] let X, equal 1 or 0 according 
to whether the kth ball drawn is black or red. Then S, is the number of 
black balls in n drawings. Prove that the law of large numbers does not apply 
to {X;,}. (Hint: Use problems 11 and IX, 30.) 


13. The mutually independent random variables X, assume the values 
r= 2,3,4,... with probability p, = c/(r? log r) where c is a constant such that 
ΣΡ, = 1. Show that the generalized law of large numbers (4.1) holds if we 
put e, = c-n log log n. 

14. Let {X,} be a sequence of mutually independent random variables 
such that X, = +1 with probability (1 — 2—”)/2 and X, = +2” with prob- 
ability 2-"—!. Prove that both the weak and the strong law of large numbers 
apply to {X;,}. (Note: This shows that the condition (5.6) is not necessary.) 


15. Example of an unfavorable ‘‘fair’”’ game. Let the possible values of the 
gain at each trial be 0, 2, 25, 28, ...; the probability of the gain being 2" is 
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1 
Pk = DeR(k + 1)’ 


and the probability of 0 is p = 1 — (p1 + pe+...). The expected gain is 
(8.2) p= lp = (1-3) Ὁ (ὁ -- ἢ Ὁ (ὦ -- ἢ Ἔ...5Ξ 1. 


Assume that at each trial the player pays a unit amount as entrance fee, so 
that after 7 trials his net gain (or loss) isS, — απ. Show that for every e > 0 
the probability approaches unity that in n trials the player will have sustained a 
loss greater than (1 — €)n/Logen, where Loge n denotes the logarithm to the 
base 2. In symbols, prove that 


(8.1) 


( -- mn εὐ ἧς 


(8.3) Ρ {Sn ca on 


Hint: Use the truncation method of section 4, but replace the bound n Log n 
of (4.2) by n/Logen. Show that the probability that U, = X; for all k <n 
tends to 1 and prove that 


en 
(8.4) P{|U,+...+U, -- nE(U,)| δι Ὁ ἘΝῚ 
1 1+e 
(8.5) 1 Lon E(U}) > 1 {πῆς π᾿ 


For details see the paper cited in footnote 5. 


16. Let {X,} be a sequence of mutually independent random variables with 
a common distribution. Suppose that the X, do not have a finite expectation 
and let A be a positive constant. The probability is one that infinitely many 
among the events |X,| > An occur. 


17. Converse to the strong law of large numbers. Under the assumption of prob- 
lem 16 there is probability one that |S,| > An for infinitely many n. 

18. A converse to Kolmogorov’s criterion. If Zo;,7/k? diverges, then there exists 
a sequence { X;} of mutually independent random variables with Var{X;} = 0,2 
for which the strong law of large numbers does not apply. (Hint: Prove first 
that the convergence of ZP{|X,| > en} is a necessary condition for the strong 
law to apply.) 


yes 


CHAPTER XI 


Integral Valued Variables. 


Generating Functions 


1. GENERALITIES 


Among discrete random variables those assuming only the integral 
values k = 0, 1, 2, ... are of special importance. Their study is 
facilitated by the powerful method of generating functions which will 
later be recognized as a special case of the method of characteristic 
functions on which the theory of probability depends to a large extent. 
More generally, the subject of generating functions belongs to the 
domain of operational methods which are widely used in the theory of 
differential and integral equations. In the theory of probability gener- 
ating functions have been used since DeMoivre and Laplace, but the 
power and the possibilities of the method are rarely fully utilized. 


Definition. Let do, a), dz, ... be a sequence of real numbers. If 
(1.1) A(s) = ao + a18 -|- (58. +., τον 


converges in some interval —sy « 8 < 8, then Α (8) 1s called the generating 
function of the sequence {a;}. 


The variable s itself has no significance. If the sequence {a;} is 
bounded, then a comparison with the geometric series shows that (1.1) 
converges at least for [8] « 1. 


Examples. If a; = 1 for all 2, then A(s) = 1/(1 — s). The gener- 
ating function of the sequence (0,0, 1, 1,1, ...) is s?/(1 —s). The 
sequence a; = 1/j! has the generating function ο΄. For fixed n the 


n 
sequence a; = ( ) has the generating function (1 +s)”. If X is the 
J 


number scored in a throw of a perfect die, the probability distribution 
of X has the generating function (8 + s* + 88 + s* + 8 + s°)/6. 
248 
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Let X be a random variable assuming the values 0, 1, 2, .... It 
will be convenient to have a notation both for the distribution of X 
and for its tails, and we shall write 


(1.2) PiX=j} =p; P{X>j} = y. 

Then 

(1.3) Gk = Petit Pero t..., k> 0. 
The generating functions of the sequences {p;} and {g,} are 

(1.4) P(s) = po + p18 + pos? + pgs? +... 

(1.5) Q(s) = go + gis  φοδ + ggs? +.... 


As P(1) = 1, the series for P(s) converges absolutely at least for 
—-l1ss<l. The coefficients of Q(s) are less than unity, and so the 
series for Q(s) converges at least in the open interval —1 < 5 < 1. 


Theorem 1. For —1 « 5 < 1 we have 


| 1 — P(s) 
(1.6) Q(s) = ————— 
] - 8 
Proof. The coefficient of 87 in (1 — s)-Q(s) equals qn — 4,..1 = --ρ, 
when n > 1, and equals go = pi + po +...= 1 — po when n = 0. 


Therefore (1 — s)-Q(s) = 1 — P(s) as asserted. 


Next we examine the derivative 


(1.7) P'(s) = D> kp,s*—!, 
k=1 


The series converges at least for --] « 8 <1. For s = 1 the right 
side reduces formally to Zkp, = E(X). Whenever this expectation 
exists, the derivative P’(s) will be continuous in the closed interval 
-1 “95 ΚΙ. If 2kp, diverges, then P’(s) — “ass -- 1. In this case 
we say that X has an infinite expectation and write P’(1) = E(X) = o, 
(All quantities being positive, there is no danger in the use of the sym- 
bol 99.) Applying the mean value theorem to the right side in (1.6), 
we see that Q(s) = P’(c) where σ is a point lying between sand 1. The 
function Q(s) increases monotonically as s —> 1, and so Q(s) —> E(X) 
(finite or infinite). This proves 


Theorem 2. For E(X) we have the two expressions 


(1.8) E(X) = D0 jp; = Do φι. 
j=l k=0 
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In terms of the generating functions 
(1.9) E(X) = P’(1) = QJ). 


By differentiation of (1.7) and of the relation P’(s) = Q(s) — 
— (1 — s)Q’(s) we find in the same way 


(1.10) E(X(X — 1)) = Dk(k — 1)p, = P’(1) = 2Q’(1). 


To obtain the variance of X we have to add E(X) — E?(X) which 
leads us to 


Theorem 3. We have 
(1.11) Var(X) = P’(1) + P’(1) — P?(1) = 
= 2Q’(1) + Q(1) — Q7(1). 
In the case of an infinite variance Ῥ' (8) -- οὦ as s— 1. 


Frequently the formulas (1.9) and (1.11) provide the simplest means 
to calculate E(X) and Var(X). 


2. CONVOLUTIONS 
Let X and Y be non-negative independent integral-valued random 
variables with probability distributions P{X = 27) =a; and 
P{Y =j} =b;. The event (X =j, Y = k) has probability ajb,. 
The sum S = X + Y is a new random variable, and the event S = r 
is the union of the mutually exclusive events 


(X=0, Y=r), (X=1, Y=r—1), (K=2, Y=r—2), ..., (X=r, Y=0). 
Therefore the distribution c, = P{S = 7} is given by 
(2.1) 6, = αοῦ, + ayb,_1 + Gobdr_o +... + Gr_1b1 + a, bo. 


The operation (2.1), leading from the two sequences {a,} and {b;} 
to a new sequence {cz}, occurs so frequently that it is convenient to 
introduce a special name and notation for it. 


Definition. Let {a,} and {by} be any two number sequences (not 
necessarily probability distributions). The new sequence {c,} defined by 
(2.1) is called the convolution 1 of {ax} and {by} and will be denoted by 


(2.2) {cx} = {ax} {bz}. 


1Some writers prefer the German word faliung. The French equivalent is com- 
position. 
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Examples. (a) If a, = ὃ = 1 for all k > 0, thencg, =k+1. If 
a, =k, by = 1, thencg =14+2+...¢h =k +1)/2. Finally, if 
Qo = αι = 5, ακ = O fork > 2, then cy = (by + bz_1)/2, ete. 


The sequences {a;,} and {b;,} have generating functions A(s) = Da;,s* 
and B(s) = Zb,s*. The product A(s)B(s) can be obtained by termwise 
multiplication of the power series for A(s) and B(s). Collecting terms 
with equal powers of s, we find that the coefficient c, of s” in the expan- 
sion of A(s)B(s) is given by (2.1). We have thus the 


Theorem. 70 {ax} and {by} are sequences with generating functions 
A(s) and B(s), and {c;,} 18 their convolution, then the generating function 
C(s) = Zc,s" ds the product 


(2.3) C(s) = A(s)B(s). 


If X and Y are non-negative integral-valued mutually independent random 
variables with generating functions A(s) and B(s), then their sum X + Y 
has the generating function A(s)B(s). 


Let now {ax}, {bz}, {cx}, {dz}, ... be any sequences. We can form 
the convolution {a,}*{b,}, and then the convolution of this new se- 
quence with {c,}, etc. The generating function of {a,}*{b,}* {cx} * {dp} 
is A(s)B(s)C(s)D(s), and this fact shows that the order in which the con- 
volutions are performed is immaterial. For example, {a,}*{b,}#{c,} = 
= {c,}*{b,}*{az}, etc. Thus the convolution is an associative and com- 
mutative operation (exactly as the summation of random variables). 

In the study of sums of independent random variables X,, the special 
case where the X, have a common distribution is of particular interest. 
If {a;} ts the common probability distribution of the X,, then the distribu- 
tion of Sn = Χι +...+ Xn will be denoted by {a;}"*. Thus 


(2.4) {aj}?* = {aj}*{a;},  {aj}®* = {a,}?** {aj}, . 

and generally 

(2.5) {aj}"* = {aj} @—Y ** {aj}. 

In words, {a;}"* is the sequence of numbers whose generating function is 
A”(s). In particular, {a;}** is the same as {a;}, and {a;}°* is defined 


as the sequence whose generating function ts A°(s) = 1, that 18, the se- 
quence (1,0, 0,0, ...). 
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Examples. (b) Binomial distribution. The generating function of 


the binomial distribution b(k; n, p) = : p*q”—* is 


(2.6) » ι (ps)*q"—* = (ᾳ + ps)”. 
k=0 


The fact that this generating function is the nth power of g + ps shows 
that {b(k; n, p)} is the distribution of a sum S, = X; +...+ X, of n 
independent random variables with the common generating function 
q + ps; each variable X; assumes the value 0 with probability ¢ and the 
value 1 with probability p. Thus 


(2.7) {b(k; n, p)} = {b(k; 1, p)}"*. 


The representation S, = X; +...-+ X, has already been used [e.g., 
in examples IX(8.a) and ΓΧ (δ.α)}]. The preceding argument may be 
reversed to obtain a new derivation of the binomial distribution. The 
multiplicative property (q + ps)"(q + ps)” = (ᾳ + ps)™*” shows also 
that 


(2.8) {b(k; m, p)}*{b(k; n, p)} = {b(k; m+n, p)} 


which is the same as formula VI(10.4). Differentiation of (¢ + ps)” 
leads also to a simple proof that E(S,) = np and Var(S,) = npgq. 

(c) Poisson distribution. The generating function of the distribution 
p(k; >) = er*/k! is 


00 k 
(2.9) ye Oe eh TN, 
k=0 μ' 
It follows that 
(2.10) {p(k; d)}*{p(k; μ})} = {p(k;A+H)}, 


which is the same as formula VI(10.5). By differentiation we find 
again that both mean and variance of the Poisson distribution equal A 


[cf. example IX(4.c)]. 
(d) Geometric and negative binomial distributions. Let X be a random 
variable with the geometric distribution 


(2.11) P{X = k} = φῇ», k=0,1,2,... 


where p and q are positive constants with p + q = 1. The corresponding 
generating function 1s 


(2.12) p >, (qs)* = 
k=0 


1 —gs 


XI.2] CONVOLUTIONS 253 


Using the results of section 1 we find easily E(X) = q/p and 
Var(X) = g/p”, in agreement with the findings in example IX(3.c). 

In a sequence of Bernoulli trials the probability that the first success 
occurs after exactly k failures (i.e., at the k-+1st trial) is ¢*p, and so X 
may be interpreted as the wazting time for the first success. Strictly 
speaking, such an interpretation refers to an infinite sample space, and 
the advantage of the formal definition (2.11) and the terminology of 
random variables is that we need not worry about the structure of the 
original sample space. The same is true of the waiting time for the rth 
success. Τί X, denotes the number of failures following the (k—1)st 
and preceding the kth success, then S, = X; + ΣΧ +...+ X, is the 
total number of failures preceding the rth success (and S, + r is the 
number of trials up to and including the rth success). The notion of 
Bernoulli trials requires that the X;, should be mutually independent 
with the same distribution (2.11), and we can define the X; by this 
property. Then S, has the generating function 


(2.13) (- £ -) 


and the binomial expansion II(8.7) shows at once that the coefficient 
of s* equals 


(2.14) I(k;7r, p) = ("’) p"(—9)*, S042 oe 


It follows that P{S, = k} = f(k;7r, p), im agreement with the formula 
for the number of failures preceding the rth success derived in chapter 
VI, section 8. We can restate this result by saying that the distribu- 
tion {f(k;r, p)} 18 the r-fold convolution of the geometric distribution with 
tiself, in symbols 


(2.15) {{(; τ, p)} = {q*p}"*. 


So far we have considered 7 as an integer. It will be recalled from chap- 
ter VI, section 8, that {f(k;7, p)} defines the negative binomial distri- 
bution also when r > 0 is not an integer. The generating function is 
still defined by (2.13), and we see that for arbitrary r > 0 the mean 
and variance of the negative binomial distribution are rq/p and rq/p* and 
that 


(2.16) {f(k; τὰν p)}*tf(kj re, p)} = (fk; τὰ +72, p)}. 
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3. APPLICATION TO FIRST PASSAGE AND RECURRENCE 
TIMES IN BERNOULLI TRIALS 


This section is inserted mainly for illustration. The results will be 
derived by different methods (see example XIII(3.b) and problem 
XIII.7; chapter XTV, section 5, and problems 11 and 15-17). For the 
special case p = > the results are contained in chapter III. However, 
the following derivation provides an excellent example for the method 
of generating functions; and, in addition, it is instructive to compare 
the different approaches. 

We consider Bernoulli trials with the probability of success p and 
put X; = 1 if the kth trial results in success, X, = —1 otherwise. 
Then S, = X, +...+ X, is the accumulated excess of successes over 
failures in n trials. In the more picturesque gambling language S,, is 
called Peter’s net gain in the first n trials. It is convenient to put 
So = 0. 


(a) First Passages 


Suppose that Peter decides to quit at the first moment when he has 
a positive net gain (necessarily of a unit amount). A direct enumera- 
tion of all possibilities reveals that this will happen at trials number 
1, 3,5, 7, ... with probabilities p, gp”, 2q*p?, 5q?p*, ... but a general 
rule is not discernible. The sum o of these probabilities equals the 
probability that Peter’s net gain will ever become positive. Not even this 
quantity can be obtained by a direct argument, but we shall show that 
o=lifp>qandc=p/qifp<gq. Waiting for the net gain to in- 
crease to x units amounts to waiting x times in succession for an in- 
crease of a unit amount. The probability that Peter’s gain will ever 
reach the level of x units therefore equals o*. We proceed to calculate σ 
and the probabilities \ that it will take exactly n trials until the net 
gain reaches the level x for the first time. 

In more formal language we seek the probability , that S, < 0, 
Se $0, ...,Sr_1 < 0, S, = 1. More generally, we shall say that a 
first passage through the point x > 0 occurs at the nth trial if 


(3.1) Si <2, Se <a, ...,Sra1 < 2, 5, = Ζ. 


The probability of this event will be denoted by A, and for brevity we put 
hn = An. In gambling (3.1) signifies that Peter’s net gain reaches the 
level x > 0 for the first time at the nth trial. The term first passage 
is suggested by applications to diffusion theory. 

Suppose now that the first passage through x = 1 occurs at the γι ἢ 
trial. The later trials produce the cumulative net gains S’, = Χ, ,., 
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S’o = X+41 + Kris, ..., which are independent of the first r trials. 
A first passage through x = 2 at time n occurs if, and only if, S’; < 0, 

π ere < 0, Sir = 1, and the probability of this event is \,_,. 
In other words, the probability that the first passages through z = 1 
and x = 2 occur at trials number 7 and n > r is AyAn_,. We conclude 
that the first passage through xz = 2 at time ἢ has probability 


(3.2) λ = λιλ,.-.χ + λολ,..2 Hb. -- + Anat. 


Remembering that λο = 0, we see that {A} = {An}*{A,} is the con- 
volution of {,} with itself. Introducing the generating functions 


οΌ 


(8.3) A(s) = st, \@)(s) = D> AL” 


n=1 n=1 
we have A‘)(s) = A?(s) and, repeating the argument by induction, 
(3.4) λί (9) = A*(s). 


It follows that our task has been reduced to finding the probabilities 
An for the first passage through + = 1. If X, = 1 then this first 


passage takes place at the first trial. If X,; = —1 the cumulative net 
gains Xp, Χο -++ Xs, ... after the first trial must increase by two units, 
and we conclude that 

(3.5) Al = D, An = gq”, n> 1. 


This is obviously equivalent to 
(3.6) A(s) = ps + gsr*(s), 


which is a quadratic equation for \(s). Of the two roots one is un- 
bounded near s = 0, and the unique bounded solution of (3.6) is 
1 — {1 — 4pqs*}} 
(3.7) A(s) = 1 — {1 — 4pge"}} 
268 

We have thus found the generating functions (3.4) of all first passage 
times. The binomial expansion II(8.7) enables us to write down the 
coefficients 


1 i 
(3.8) hese «(ἢ (4ρῳ)" (--15τὶ hom = 0 
2q \m 


but we are not interested in explicit. expressions; it is more instructive 
to extract the relevant information directly from the generating func- 
tion. 
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First note that 


ical 
(3.9) a= aw 
q 


and so A(1) = 1 if p > q but A(1) = p/q if p < gq. We conclude that 
Σλ equals 1 or p/g, whichever is smaller; when q is larger than p (a game 
unfavorable to Peter), the probability that the sums S, remain negative 
forever equals (q — p)/q. 

In the symmetric case p = q = ᾧ and 2d, = 1; in a prolonged se- 
quence of coin tossings Peter is sure that he will sooner or later realize a 
positive gain. The question is: How long will it take? From λ΄ (1) = © 
we conclude that in coin tossing the number of trials preceding the first 
passage through 1 has infinite expectation. If Peter hopes to realize a 
unit gain by participating in a coin-tossing game and quitting at the 
first opportune moment, he should expect that an enormous number 
of trials (and, in consequence, an enormous capital) will be required. 
Needless to say that the infinite expectation of the first-passage time 
is closely connected with the unexpected characteristics of the fluctua- 
tions in coin tossing discussed at great length in chapter ITI. 


Note. We are now in possession of an explicit formula for \, but there remains 
the task to calculate the first passage probabilities \ from (3.3) or (3.4). The 
standard analytic procedure for that consists in applying complex variable methods. 
It is therefore interesting to remark that simple applications of the reflection prin- 
ciple enabled us in theorem 2 of chapter ITI, section 4, to write down an explicit 
expression for λί at least in the symmetric case p = gq = 4. (With the notations 
used in chapter III we have f§ = \$?_,.) A glance at (3.4) and (3.7) reveals the 
pleasing feature that for arbitrary p the probability 1 equals the corresponding 
probability in the symmetric case multiplied by (4pq)?"(p/q)**. It is instructive 
to follow this case in detail and realize that a most elementary combinatorial argu- 
ment enabled us to solve a difficult technical problem and that it replaces a formidable 
analytical apparatus. 


(b) Recurrence Times 


We shall say that a first return to zero occurs at the nth trial if S; # 0, 
So ~0, ..., S,_1 #0, S, = O (e., if the first equalization of the 
accumulated numbers of successes and failures occurs). Let fn be 
the probability of this event. (Clearly fons; = 0 for all n. The first 
few fon are easily found by direct enumeration: fo = 2pq, fs = 2p7’, 
fe = 4p°q°, fg = 10p*q*.) 

Let λί Ὁ be the probability of a first passage through z = —1 at 
the nth trial; in other words, \$~” is the quantity obtained from 
\ =r, by interchanging p and gq. As above we note that a return 
to zero in ἢ trials is equivalent to a first passage through either +1 
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or —1 in the n — 1 trials following the first trial, and we conclude 
(3.10) fn = Dn—1 + PAS}. 


Multiply by s” and add. Observing that the generating functions of 
(λ, Ὁ} and {An} are obtained from each other by interchanging p 
and q, we get 


(3.11) F(s) = Zfns" = 1 — (1 — 4pgqs?)}. 


We conclude: The probability ΣΙ, that the accumulated numbers of suc- 
cesses and failures will ever equalize is 1 — |p — 4]. 

In the special case p = gq = 4 we find that =f, = 1 but the prob- 
ability distribution {fn} has infinite expectation. The probabilities fon 
were calculated, by entirely different methods, in chapter III, section 4. 
It is illuminating to note that several theorems of chapter III can be 
obtained without calculation and without explicit expressions for fon 


directly from the generating function F(s). (See problems 6-10.) 


Note. Conceptually, the problem of this section is analogous to the waiting 
time problem of example (2.d). In the sample space of infinite sequences of Bernoulli 
trials we may consider the random variable N, defined as the number of trials from 
the first passage through r — 1 up to and including the first passage through r. 
The {N,} are mutually independent variables with the common generating function 
A(s). The sum N® =N,+...+N, is the waiting time for the first passage 
through x and has the generating function *(s). We have formally avoided re- 
ferring to infinite sample spaces by defining the random variables in terms of their 
distributions. From an analytic point of view the theory is rigorous and self- 
contained, but for the probabilistic interpretation and for the intuition it is pref- 
erable to keep the natural infinite sample space in mind. 


4, PARTIAL FRACTION EXPANSIONS 


Given a generating function P(s) = =p,s* the coefficients p;, can be 
found by differentiations from the obvious formula p, = P“ (0) /k!, 
In practice it may be impossible to obtain explicit expressions and, 
anyhow, such expressions are frequently so complicated that reason- 
able approximations are preferable. The most common method for 
obtaining such approximations is based on partial fraction expansions. 
It is known from the theory of complex variables that a large class of 
functions admits of such expansions, but we shall limit our exposition 
to the simple case of rational functions. 

Suppose then that the generating function is of the form 


4.1 Ὁ 
(4.1) 8) = Fe 
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where U(s) and V(s) are polynomials without common roots. For 
simplicity let us first assume that the degree of U(s) is lower than the 
degree of V(s), say m. Moreover, suppose that the equation V(s) = 0 
has m distinct (real or imaginary) roots 81, 82, ..., 8m. Then 


(4.2) V(s) = (8 — 81)(8 — $2) +++ (8 — 8m), 
and it is known from algebra that P(s) can be decomposed into partial 
fractions 
p p m 
ιν BA cake p 


8 —S 80. -- 8 Sm — 8 


(4.3) P(s) = 


where pi, po, ---, Pm are constants. To find p; multiply (4.3) by 
81 — 8; aS 8 — 81 the product (81 — s)P(s) tends to p;. On the other 
hand, from (4.1) and (4.2) we get 


Or Cae” een... 
| "Go Goa eG) 


As s — 81 the numerator tends to —U(s,) and the denominator to 
(8ι — 89)(8; — 88) ... (81 — Sm), Which is the same as V’(s,). Thus 
ρι = —U(s:)/V'(s1). The same argument applies to all roots, so that 
fork <m 

— U(sx) 


γ΄ (δι) 


(4.5) : pt = 


Unfortunately, extensive numerical calculation is usually required 
to put (4.1) into the form (4.3). However, once the expansion (4.3) 
is obtained, we can easily derive an exact expression for the coefficient 
of s” in P(s). Write 


4.6 
(4.6) 8. —S Ss 1 -- 878 


For [8] <|s,| we expand the last fraction into a geometric series 


1 8 8Ὰ2 /s\3 
(4.7) -----..--.-- = 1+—4+(—] - 1--Ξ} τ΄... 
1— s/ Sk Sk Sk Sk 
Introducing these expressions into (4.3), we find for the coefficient pp 
of s” 


p p 
(4.8) Dn =~ toi tet 
δὶ 82 


Pm 
+1° 


Sin 
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Thus, to get p, we have first to find the roots 81, ..., 8, of the 
denominator and then to determine the coefficients p1, ..., pm from 
(4.5). 

In (4.8) we have an exact expression for the probability p,. The 
labor involved in calculating all m roots is usually prohibitive, and 
therefore formula (4.8) is primarily of theoretical interest. Fortunately 
a single term in (4.8) almost always provides a satisfactory approxima- 
tion. In fact, suppose that 8) is a root which is smaller in absolute 
value than all other roots. Then the first denominator in (4.8) is 
smallest. Clearly, as n increases, the proportionate contributions of 
the other terms decrease and the first term preponderates. In other 
words, tf 8; 18 a root of V(s) = 0 which ts smaller in absolute value than 
all other roots, then, asn — ©, 


Pl 
(4.9) Pna™ gti 


(the sign ~ indicating that the ratio of the two sides tends to 1). 
Usually this formula provides surprisingly good approximations even 
for relatively small values of n. The main advantage of (4.9) lies in 
the fact that it requires the computation of only one root of an algebraic 
equation. 

It is easy to remove the restrictions under which we have derived 
the asymptotic formula (4.9). To begin with, the degree of the numer- 
ator in (4.1) may exceed the degree m of the denominator. Let U(s) 
be of degree m + r (r > 0); a division reduces P(s) to a polynomial of 
degree r plus a fraction U;(s)/V(s) in which U;(s) is a polynomial of a 
degree lower than m. The polynomial affects only the first r + 1 terms 
of the distribution {p,}, and U1(s)/V(s) can be expanded into partial 
fractions as explained above. Thus (4.9) remains true. Secondly, the 
restriction that V(s) should have only simple roots is unnecessary. It 
is known from algebra that every rational function admits of an expan- 
sion into partial fractions. If s, is a double root of V(s), then the par- 
tial fraction expansion (4.3) will contain an additional term of the 
form a/(s — s,)*, and this will contribute a term of the form 
a(n + 1)s, 1 722 to the exact expression (4.8) for pr. However, this 
does not affect the asymptotic expansion (4.9), provided only that s, 
is a simple root. We note this result for future reference as a 


Theorem. I[f P(s) is a rational function with a simple root 81 of the 
denominator which ts smaller in absolute value than all other roots, then 
the coefficient pr of s” is given asymptotically by pn ~ p1s;—*), where 
p1 18 defined in (4.5). 
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A similar asymptotic expansion exists also in the case where 81 is a 
multiple root. (See problem 25.) 


Examples. (a) Let a, be the probability that » Bernoulli trials 
result in an even number of successes. This event occurs if an initial 
failure at the first trial is followed by an even number of successes or 
if an initial success is followed by an odd number. Therefore 


(4.10) An = GOn—1 + Ρ(] — an_1), dy = 1. 


Multiplying by s” and adding we get the relation A(s) — 1 = 4854 (5) + 
+ ps(1 — s)~! — psA(s) for the generating function A(s). Hence 


(4.11) 2A(s) = {1—s}7-*+ {1—(q—p)s}™, 2a, =1+ (q—p)”. 


Observe that the last formula is in every way preferable to the obvious 
answer dy, = b(0;n, p) + 0(2;n, p) +.... 

(b) Let gn be the probability that in n tosses of an ideal coin no run 
of three consecutive heads appears. (Note that {q,} is not a probability 
distribution; if p, is the probability that the first run of three consecu- 
tive heads ends at the nth trial, then {p,} is a probability distribution, 
and gn represents its ‘‘tails,” 40 = Dn4i + Ῥη..2 +...-.) 

We can easily show that q,, satisfies the recurrence formula 


(4.12) dn = 5dn—1 ae +Qn—2 + aIn—3- 


In fact, the event that n trials produce no sequence HHH can occur 
only when the trials begin with 7, HT, or HHT. The probabilities 
that the following trials lead to no run HHA are gn_1, Qn—2, and Qn_3, 
respectively, and the right side of (4.12) therefore contains the prob- 
abilities of the three mutually exclusive ways in which the event “‘no 
run HHH” can occur. 

Evidently go = 91 = {9 = 1, and hence the gq, can be calculated 
successively from (4.12). To obtain the generating function Q(s) = 
= Lgns” we multiply both sides by s” and add. We get 


52 


8 
Q(s) -- 1-8 = =5 (Qs) --ἴ -- ὁ) +710) - ἢ + AW) 


or 
257 + 48 ++ 8 
(4.13) Q(s) = ay er er 


The denominator has the root s; = 1.0873778... and two complex 
roots. For [8] < s; we have |4s + 2s? + 85] < 4s; + 28,2 + s,3 = 8, 
and the same inequality holds also when |s| = 81 unless s = s,. Hence 
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the other two roots exceed 85 in absolute value. Thus, from (4.9) 


1.236840 
(1.0873778)""} 


where the numerator equals (2812 + 48, + 8)/(4 + 4s, + 38,2). This 
formula gives remarkably good approximations even for small values 
of n. It approximates 498 = 0.875 by 0.8847 and q, = 0.8125 by 
0.81360. The percentage error decreases steadily, and 419 = 0.41626... 
is given correct to five decimal places. 


5. BIVARIATE GENERATING FUNCTIONS 


For a pair of integral-valued random variables X, Y with a joint 
distribution of the form 


(4.14) Qn 


(5.1) PiX=j,Y=k} = pyr j,k =0,1,... 
we define a generating function depending on two variables 
(5.2) P(s,, 8.) = > DjnS1?8Q". 

7.8 


Such generating functions will be called bivariate for short. 

The considerations of the first two sections apply without essential 
modifications, and it will suffice to point out three properties evident 
from (5.2): 

(a) The generating function of the marginal distributions P{X = 7) 
and P{Y = k} are A(s) = P(s, 1) and B(s) = P(1,s). 

(6) The generating function of X + Y is P(s, s). 

(c) The variables X and Y are independent if, and only if, P(s,, 82) = 
= A(s}) B(s2) for all 81, So. 


Examples. (a) Bivariate Poisson distribution. It is obvious that 
(5.3) P8583) = et ee ear aner bees. aS). b> 0 


has a power-series expansion with positive coefficients adding up to 
unity. Accordingly P(s, 82) represents the generating function of a 
bivariate probability distribution. The marginal distributions are 
Poisson distributions with mean a; + ὃ and ag + ὃ, respectively, but 
the sum X + Y has the generating function e—@!—@2—®t+ (aitaz)s+bs? og 
is not a Poisson variable. (It is a compound Poisson distribution; see 
chapter XII, section 2.) 

(b) Multinomial distributions. Consider a sequence of n independent 
trials, each of which results in Eo, £1, or Ez with respective probabilities 
Po, P1, p2- If X; is the number of occurrences of E,, then (Xi, X) has 
a trinomial distribution with generating function (pp) + 2181 + DeSo)”. 
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*6. THE CONTINUITY THEOREM 


We know from chapter VI that the Poisson distribution {e~*v*/k!} 
is the limiting form of the binomial distribution with the probability 
p depending on n in such a way that np — ἃ as ἢ — ©, Then 
b(k;n, p) -- er*/k!. The generating function of {b(k;n, p)} is 
(ᾳ + ps)” = {1 —A(1 — 8)/n}". Taking logarithms, we see directly 
that this generating function tends to e~*“~®, which is the generating 
function of the Poisson distribution. We shall show that this situation 
prevails in general; a sequence of probability distributions converges 
to a limiting distribution if and only if the corresponding generating 
functions converge. Unfortunately, this theorem is of limited applica- 
bility, since the most interesting limiting forms of discrete distributions 
are continuous distributions (for example, the normal distribution ap- 
pears as a limiting form of the binomial distribution). 


Continuity Theorem. Suppose that for every fixed n the sequence 
Qo,n) 1,5) A2,n, --- 18 a probability distribution, that 18, 


(6.1) Aen 20, Σιν Gin = 1. 
k=0 

In order that for every fixed k 

(6.2) ἀκ —> ἀκ 


asn — ©, it ts necessary and sufficient that for every swithO<s <1 


(6.3) An(s) — A(s). 

Here 

(6.4) A,(s) = >> dens*, A(s) = >> a,s* 
k=0 k=0 


denote the corresponding generating functions. 


Note. If (6.2) holds, then automatically 0 < a, < 1 and Za; < 1. 
The generating function A(s) exists therefore at least for |s|< 1. 
However, the limiting sequence {a;,} is not necessarily a probability 
distribution; for example, if the first n terms of the distribution {a;,n} 
vanish, then the limiting sequence vanishes identically. For {a,} to 
be a probability distribution it is necessary and sufficient that Za; = 1 
or A(1) = 1. 


* The contents of this section will not be used in the sequel. 
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Proof.’ First, suppose that (6.2) holds. For fixed 8 (0 « s < 1) and 
fixed e we can choose r so that s"/(1 — s) < ε. Then 


(6.5) |An(s) — A(s)| < Dilan — ἀκ [ 8} + Qe. 
k=0 


The sum on the right contains only finitely many terms, each of which 
tends to zero. Hence |A,(s) — A(s)| is arbitrarily small for n suffi- 
ciently large. Next, assume that (6.3) holds. We use the well-known 
fact ὁ that it is always possible to find a subsequence ἔαρ αἱ of the 
given sequence of distributions which converges. If (6.2) were not true, 
then it would be possible to extract two subsequences converging to 
two different limiting sequences ἴα ἢ) and {a,**}, and the correspond- 
ing subsequences of {A,(s)} would converge to A*(s) = Sa,*s* and 
A**(s) = Ya,**s", respectively. However, this is impossible in view 
of the assumption (6.3). Therefore (6.3) implies (6.2). 


Examples. (a) The negative binomial distribution. Wesaw in exam- 
ple (2.d) that the generating function of the distribution {f(k; τ, p)} is 
p’(1 — gs)". Now let be fixed, and let p -- 1, ᾳ — O, so that 
q=)/r. Then 


p\ 1—dA/r\" 
ΠΝ  ΣΞῚ 
1 — qs 1 — As/r 
Passing to logarithms, we see that the right side tends to e**™*, which 


is the generating function of the Poisson distribution {e~*r*/ ki}. 
Hence 7f r — ο and rq — δ, then 


k 


(6.7) ΤΟ; τ, p) - e ΩΝ 


(6) Bernoulls trials with variable probabilities. Consider n independ- 
ent trials such that the kth trial results in success with probability p, 
and in failure with probability gq, = 1 — py. The number §S,, of suc- 
cesses can be written as the sum S, = X; +...+ X, of ἢ mutually 
independent random variables X;, with the distributions P{X;, = 0} = gz, 


* The theorem is a special case of the continuity theorem for Laplace-Stieltjes 
transforms, and the proof follows the general pattern. In the literature the conti- 
nuity theorem for generating functions is usually stated and proved under unneces- 
sary restrictions. 

8 This is easily established by the “method of diagonals’ due to G. Cantor and 
found in all books on set theory. The statement is, incidentally, a special case of 
a well-known theorem of Helly. 
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P{X, = 1} = px. The generating function of Χμ is qx + pxs, and 
hence the generating function of S, 


(6.8) P(s) = (ᾳ + pi8) (G2 + pos) +++ (Qn + Dn). 


As an application of this scheme let us assume that each house in a 
city has a small probability p; of burning on a given day. The sum 
pi +...+ Dn is the expected number of fires in the city, n being the 
number of houses. We have seen in chapter VI that if all p; are equal 
and if the houses are stochastically independent, then the number of 
fires is a random variable whose distribution is near the Poisson dis- 
tribution. We show now that this conclusion remains valid also under 
the more realistic assumption that the probabilities p;, are not equal. 
This result should increase our confidence in the Poisson distribution 
as an adequate description of phenomena which are the cumulative 
effect of many improbable events (‘successes’). Accidents and tele- 
phone calls are typical examples. 

We use the now familiar model of an increasing number n of variables 
where the probabilities p;, depend on n in such a way that the largest 
p, tends to zero, but the sum p; + po +...+ pa = A remains con- 
stant. Then from (6.8) 


(6.9) log P(s) = Σὺ log {1 — ριᾷ — s)}. 
k=1 
Since p, — 0, we can use the fact that log (1 — x) = —x — 6x, where 


θ- 0asx — 0. It follows that 


(6.10) log P(s) = —(1 — 8) p> (pr, + aps) } — --λλᾷ — s), 
=i 


so that P(s) tends to the generating function of the Poisson distribu- 
tion. Hence, S, has in the limit a Poisson distribution. We conclude 
that for large n and moderate values of ἃ = py + po +...+ pn the 
distribution of S, can be approximated by a Poisson distribution. [Cf. 
example IX(5.b).] 


7. PROBLEMS FOR SOLUTION 


1. Let X be a random variable with generating function P(s). Find the 
generating functions of X + 1 and 2X. 

2. Continuation. Find the generating functions of (a) P{X < πη}, (0) 
P{X < n}, (c) P{X> n}, d) P{X¥ > n+ 1}, (0) P{X = 2n}. 

3. In a sequence of Bernoulli trials let u, be the probability that the first 
combination SF occurs at trials number n — 1 and n. Find the generating 
function, mean, and variance. 
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4, Discuss which of the formulas of chapter II, section 12, represent con- 
volutions and where generating functions have been used. 


5. Let a, be the number of ways in which the score n can be obtained by 
throwing a die any number of times. Show that the generating function of 
{an} is {1 — s — 85 — 88 — δ — δῇ — 6} —1, 


Note: Problems 6-10 refer to coin tossing with the usual notations. They contain, 
among other things, a straightforward derivation of certain relations found in chapter 
III. We write un, = P{S, = 0} and f, = P{Si ᾽ξ 0, Se σέο, ..., Sra ¥ 0, 
Sn = 0} (first return); by definition uw = 1, fy = 0. We assume known (from section 
3) that {fn} has the generating function F(s) = 1 — {1 — s*}3, and nothing more. 
The calculations are practically nil, and no explicit formulas for the coefficients are 
required. 


6. The generating function of {u,} is U(s) = {1 — s?}—3. 

7. The probability that no zero occurs up to time 2n is the same as the prob- 
ability ue, that So, = 0. 

8. The probability that Se, = 0 and that all the sums Si, So, ..., Sen are 
> 0 equals 2fon+2. 

9. The probability that the first change of sign occurs following the 2nth 
trial equals 2fon+.. 

10. The probability that exactly k among the sums Sy, ..., S, are zero 
has the generating function F*(s) U(s)(1 + s). 


11. In a sequence of Bernoulli trials with p > ᾳ let a, be the probability 
that there exists an index j > n such that S; = 0. Show that a, has the gen- 
erating function 4pq[p — ᾳ + (1 — 4pqs”)*]—(1 + 9). 

12. In the warting tume example IX(3.d) find the generating function of 8, 
(for r fixed). Verify formula IX(3.3) for the mean and calculate the variance. 

13. Continuation. The following is an alternative method for deriving the 
same result. Let p,(r) = P{S, =n}. Prove the recursion formula 


rol N~-r+l1 " 
(7.1) Pasir) = Wy Prt) + —y prlr — 1). 


Derive the generating function directly from (7.1). 


14. Solve the two preceding problems for r preassigned elements (instead 
of r arbitrary ones). 


15.4 Let the sequence of Bernoulli trials up to the first failure be called a 


* Problems 15-17 have a direct bearing on the game of billiards. The probability 
p of success is a measure of the player’s skill. The player continues to play until 
he fails. Hence the number of successes he accumulates is the length of his “turn.” 
The game continues until one player has scored N successes. Problem 15 therefor 
gives the probability distribution of the number of turns one player needs to score 
k successes, problem 16 the average duration, and problem 17 the probability of a 
tie between two players. For further details cf. O. Bottema and S. C. Van Veen, 
Kansberekningen bij het biljartspel, Nieww Archief voor Wiskunde (in Dutch), vol. 
22 (1943), pp. 16-33 and 123-158. 
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turn. Find the generating function and the probability distribution of the 
accumulated number S, of successes in r turns. 

16. Continuation. Let R be the number of successive turns up to the 
vth success (that is, the vth success occurs during the Rth turn). Prove that 


P{R =r} = pg Cae Bare oh Find E(R) and Var(R). 


17. Continuation. Ἐπ ἢ two sequences of Bernoulli trials with prob- 
abilities pi, q1, and pe, 45, respectively. Show that the probability that the 
same number of turns will lead to the Nth success can be exhibited in either of 
the forms: 


Ν-»ν -- 


v= 


(pipe)™ > ( τ (ψι4)" ἢ = 


= (pip2)%(1 -- gig)! Σ τ; " ᾿" (q192)*. 


18. Let {X,} be mutually independent variables, each assuming the values 
0,1, 2, ..., a—1 with probabilities 1/a. LetS, = X, -+...+ X,. Show that 
the generating function of S, 15 


P(s) = ἸΞΞῚ 
and hence 
Ρί Ss, = fas του Ν p> (-- 1)» ἜΤΈαν ( ") Γ Ν mar 


(Only finitely many terms in the sum are different from zero.) 

Note: For a = 6 we get the probability of scoring the sum 7 + ἢ in a throw 
with n dice. The solution goes back to DeMoivre. 

19. Continuation. The probability P{S, <j} has the generating function 
P(s)/(1 — 8) and hence 


P{S, <j} = ΣΟ it ee 


20. Continuation: the limiting form. Ifa — ~ andj — ~,s0 thatj/a — z, 
then 


PiS,<j} -Ὁ ΤΣ (—1) (") (x — »)", 


the summation extending over all ν withO < v < zg. 

Note: This result is due to Lagrange. In the theory of geometric probabilities 
the right-hand side represents the distribution function of the sum of 7 in- 
dependent random variables with ‘‘uniform’”’ distribution in the interval (0, 1). 

21. Let u, be the probability that the number of successes in n Bernoulli 
trials is divisible by 3. Find a recursive relation for u, and hence the generat- 
ing function. 

22. Continuation: alternative method. Let vn and w, be the probabilities that 
S, is of the form 3v ++ 1 and 3v + 2, respectively (so that τι, + vn + Wn = 1). 
Find three simultaneous recursive relations and hence three equations for the 
generating functions. 
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23. Let X and Y be independent variables with generating functions U(s) 
and V(s). Show that P{X — Y = j} is the coefficient of 87 in U(s) V(1/s), 
where 7 = 0, +1, +2,.... 

24. Moment generating functions. Let X be a random variable with generat- 
ing function P(s), and suppose that 2p,s” converges for some s) > 1. Then all 
moments m, = E(X’) exist, and the generating function F(s) of the sequence 
m,/r! converges at least for |s| < log so. Moreover 


F(s) = > ot = Pe). 
r=0 Τὶ 


Note: F(s) is usually called the moment generating function, although in real- 
ity it generates m,/r!. 

25. Suppose that A(s) = Za,s" is a rational function U(s)/V(s) and that 
81 is a root of V(s), which is smaller in absolute value than all other roots. 
If s; is of multiplicity r, show that 


pi Ga 


st’ \ γ-1 


an ™ 


where ρι = —r!U(s1)/V(s;). 

26. Bwariate negate binomial distributions. Show that for positive values 
of the parameters po*{1 — pisi — ρε82)} 4 is the generating function of the 
distribution of a pair (X, Y) such that the marginal distributions of X, Y, 
and X + Y are negative binomial distributions.® 


5. Distributions of this type were used by G. E. Bates and J. Neyman in investiga- 
tions of accident proneness. See University of California Publications in Statistics, 
vol. 1, 1952. 


CHAPTER XII* 


Compound Distributions. 
Branching Processes 


ᾧ 


1. SUMS OF A RANDOM NUMBER OF VARIABLES 


Let {Χμ} be a sequence of mutually independent random variables 
with the common distribution P| X;, = j} = f; and generating function 
f(s) = Zf;s". We are often interested in sums Sy = X, + Xo+...+ 
+ Xy, where the number N of terms is a random variable independent 
of the X;. Let P{N = n} = gn be the distribution of N and g(s) = 
= Zgns” its generating function. For the distribution {h;} of Sy we 
get from the fundamental formula for conditional probabilities 


(1.1) hj = P{Sy =j} = Σίν P{X,+...4+X, = ἢ). 


If N assumes only finitely many values, the random variable Sy is 
defined on the sample space of finitely many X,. Otherwise the 
probabilistic definition of Sy as a sum involves the sample space of an 
infinite sequence {X;}, but we shall be dealing only with the distribu- 
tion function of Sy: for our purposes we take the distribution (1.1) as 
definition of the variable Sy on the sample space with points 0, 1,2,.... 

For a fixed n the distribution of X; + X_ +...-+ X, is given by the 
n-fold convolution of {f;} with itself, and therefore (1.1) can be written 
in the compact form 


(1.2) {hj} = x σαί fj}"* 


This formula can be simplified by the use of generating functions. The 
generating function of {f;}"* is f(s) and it is obvious from (1.2) that 


* The contents of this chapter will not be used in the sequel. 
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the generating function of the sum Sy is given by 


(1.3) h(s) = do hjs? = Σ᾽ gp f(s). 

7=0 n=O0 
The right side is the Taylor expansion of g(s) with s replaced by f (s); 
hence it equals g(f(s)). This proves the 


Theorem. The generating function of the sumSy = X, +...+ Xw 
as the compound function g(f(s)). 


Two special cases are of interest. 

(a) If the X; are Bernoulli variables with P{X; = 1} = p and 
P(X; = 0} = q, then f(s) = g + ps and therefore h(s) = g(q + ps). 

(0) If N has a Poisson distribution with mean ἐ then 


(1.4) h(s) = ett f(s), 


The distribution with this generating function will be called the com- 
pound Poisson distribution. 

If the X,; are Bernoulli variables and N has a Poisson distribution, 
then h(s) = 6 ΡΣ. the sum Sy has a Poisson distribution with mean tp. 


Examples. (a) We saw in example VI(7.c) that X-rays produce 
chromosome breakages in cells; for a given dosage and time of exposure 
the number N of breakages in individual cells has a Poisson distribu- 
tion. Each breakage has a fixed probability q of healing whereas with 
probability p = 1 — q the cell dies. Here Sy is the number of ob- 
servable breakages ' and has a Poisson distribution with mean tp. 

(b) In animal-trapping experiments? g, represents the probability 
that a species is of size n. If each animal has a fixed probability p of 
being trapped, then (assuming stochastic independence) the number 
_ of trapped representatives of one species in the sample is a variable 
Sy with generating function g(g + ps). This description can be varied 
in many ways. For example, let g, be the probability of an insect’s 
laying n eggs, and p the probability of survival of an egg. Then Sy is 
the number of surviving eggs. Again, let g, be the probability of a 
family’s having n children and let the sex ratio of boys to girls be 
p:q. Then Sy represents the number of boys in a family. 


1 See D. G. Catcheside, Genetic effects of radiations, Advances in Genetics, edited 
by M. Demerec, vol. 2, Academic Press, New York, 1948, pp. 27 1-358, in particular 
p. 339. 

21). ἃ. Kendall, On some modes of population growth leading to R. A. Fisher’s 
logarithmic series distribution, Biometrika, vol. 35 (1948), pp. 6-15. 
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(c) Each plant has a large number of seeds, but each seed has only 
a small probability of survival, and it 1s therefore reasonable to assume 
that the number of survivors of an individual plant has a Poisson dis- 
tribution. If g, represents the distribution of the number of parent 
plants, g(e~***) is the generating function of the number of surviving 
seeds. 


2. THE COMPOUND POISSON DISTRIBUTION 
We preface our considerations by two typical 


Examples. (a) Suppose that the number of hits by lightning dur- 
ing any time interval of duration ¢ is a Poisson variable with mean λέ. 
If {fn} is the probability distribution of the damage caused by an in- 
dividual hit by lightning, then (assuming stochastic independence) the 
probability distribution of the total damage during time ¢ is a compound 


--λὲ τς ᾿ ee 
Poisson distribution { > — {f;}"* with generating func- 
tion 
(2.1) h(s; ὃ = στε FO) 


(b) In ecology it is assumed that the number of animal litters in a 
plot has a Poisson distribution with mean proportional to the area ὁ 
of the plot. If {f;,} is the distribution of the number of animals in a 
litter, then (2.1) is the generating function for the total number of 
animals in the plot. 


We recall from chapter VI that many phenomena depending on time 
or space obey a Poisson distribution, and the preceding examples will 
explain why the compound Poisson distribution is also frequently con- 
nected with such phenomena. 

The generating function (2.1) has the remarkable property that 


(2.2) h(s; ty-+te) = (8; t1)h(8; te). 


In an intuitive way we may describe this as follows. With each period 
of duration ¢ there is associated a random variable with generating 
function h(s;t) which we call the contribution of that period. The 
contributions of two non-overlapping periods are independent, which 
means that a partitioning t = ¢, ++ te of a period into two parts induces 
a decomposition X(t) = X(¢,) + X(é) of Its contribution into a sum 
of two independent variables. 

In the next section it will be shown that (among ‘ntapealcvalued ran- 
dom variables) only the compound Poisson distribution has this prop- 
erty. Here we preface the formulation of the theorem by two 
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Examples. (c) The negative binomial distribution with generating 
function 


0 t 
(2.3) h(t; 8) = (. - -) pt+q=1 


does have the property (2.2). Therefore the negative binomial (2.3) is a 
compound Potsson distribution; it takes on the form (2.1) with 


Aq” 
{= 


1 1 
(2.4) A = ἴορ -- f(s) = -log 
7) λ 1 — qs n 


The distribution {g"/n} is called the logarithmic distribution. 

(d) Multiple Poisson distributions. Suppose that we classify auto- 
mobile accidents according to the number of vehicles involved as 
singlets, doublets, etc. Suppose further that the numbers of singlets, 
doublets, etc., have Poisson distributions with means )j#, Agt, ... and 
that there is no stochastic dependence among them. The total num- 
ber of vehicles involved in accidents during a period ¢ has then the 
generating function 


(2.5) eAit(l—s) ph t(1~a*) ,—Ast(1—a*) ee 


This is again a compound Poisson distribution with λὰ = Σὰ; and 
fi = λϑλ. Conversely, every compound Poisson distribution can be 
rewritten in the form (2.5) and therefore admits of the alternative 
interpretation as representing the cumulative effect of singlets, doublets, 
etc. 


3. INFINITELY DIVISIBLE DISTRIBUTIONS 


A probability distribution {h;}, i= 0, 1, ..., ts called infinitely 
dwisible, 1} for each n it can be represented as the n-fold convolution of a 
probability distribution {¢;} with ttself, that is, if its generating function 
h(s) has an nth root such that h'!"(s) = $(s) generates a probability dis- 
tribution {¢;}. 

Note that «f h(s;t) satisfies (2.2), then h(s;t) = h"(s;t/n) and 
therefore h(s;t) 18 infinitely divisible for each t. The assertion of the 
preceding section is contained in the following theorem (which is a 
special case of an important general theorem of P. Lévy concerning 
arbitrary probability distributions). 


Theorem. If {h,} 1s infinitely divisible, then its generating function 
can be written in the form (2.1) (say with t = 1). 
[Note that h’(s) = h(s; ὃ satisfies (2.2).] 
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Proof. Suppose that h1/"(s) is a probability generating function for 
each n. This is possible only if h(0O) = hp > 0. Then h(s) must be 
positive in some interval |s| << a < land init we haveO < 1 — A(s) < 
<1. It follows that log h(s) = log (1 — {1 — h(s)}) has the Taylor 
series 


(3.1) log h(s) = >> x:s --α « 53 «α. 
i=0 

Putting s = 0, we see that x9 “0. We want to prove that all other 

x; are non-negative. Assume the contrary, and let r > 1 be the smallest 

index such that x, < 0. To avoid clumsy formulas set 


γ-- οὔ 1 
(3.2) A(s) = » Xs” B(s) = »> χνϑ", -κπε 
v=] vo=r+1 ( 
so that 
(3.3) pil (g) = eho. 66 Α (6) gers”. gt BC) 


By assumption h!/"(s) = X¢,s" where ¢, > 0. Consider in particular 
the coefficient ¢, of 5. The power series B(s) contains only powers of 
order greater than r and hence does not contribute to φ,.. Therefore 
φ, 15 the coefficient of s” in 


(3.4) eXo(1 + €A(s) + ζε’ 4520) +...)(1 + ex,8"). 
Since A(s) is a polynomial of degree <r — 1 we see that 
(3.5) φ, = e*lex, + €"p(e)] 


where p(e) is a polynomial in ε. If x, <0 as assumed, the right 
side of (3.5) will be negative for ε sufficiently small, and thus ¢, < 0 
which is impossible. This proves that x, > Oforr = 1,2,.... More- 
over, h(1) = 1 and hence log h(1) = Zx, = 0, that is, —xo = x1 + 
+ yx2+.... To write h(s) in the form (2.1) with ¢ = 1 it suffices now 
to put “Χο = λ and f; = χιλ. ' 


4, EXAMPLES FOR BRANCHING PROCESSES 


We shall describe a chance process which serves as a simplified 
model of many empirical processes and also illustrates the usefulness 
of generating functions. In words the process may be described as 
follows. 

We consider particles which are able to produce new particles of like 
kind. A single particle forms the original, or zero, generation. Every 
particle has probability p, (k = 0, 1, 2, ...) of creating exactly k new 
particles; the direct descendants of the nth generation form the (n+-1)st 
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generation. The particles of each generation act independently of each 
other. We are interested in the size of the successive generations. 

A few illustrations may precede a rigorous formulation in terms of 
random variables. 

(a) Nuclear chain reactions. This application became familiar in 
connection with the atomic bomb.’ The particles are neutrons, which 
are subject to chance hits by other particles. Let p be the probability 
that the particle sooner or later scores a hit, thus creating m particles; 
then g = 1 — pis the probability that the particle has no descendants; 
that is, it remains inactive (is removed or absorbed in a different way). 
In this scheme the only possible numbers of descendants are 0 and m, 
and the corresponding probabilities are q and p (i.e., po = 4, Dm = D, 
p; = Ὁ for all other 7). At worst, the first particle remains inactive 
and the process never starts. At best, there will be m particles of the 
first generation, m? of the second, and so on. If Ὁ is near one, the 
number of particles is likely to increase very rapidly. Mathematically, 
this number may increase indefinitely. Physically speaking, for very 
large numbers of particles the probabilities of fission cannot remain 
constant, and also stochastic independence no longer holds. However, 
for ordinary chain reactions, the mathematical description “indefinitely 
increasing number of particles” may be translated by “explosion.” 

(b) Survival of family names. Here (as often in life), only male 
descendants count; they play the role of particles, and p; is the prob- 
ability for a newborn boy to become the progenitor of exactly k boys. 
Our scheme introduces two artificial simplifications. Fertility is sub- 
ject to secular trends, and therefore the distribution {p;} in reality 
changes from generation to generation. Moreover, common inheritance 
and common environment are bound to produce similarities among 
brothers which is contrary to our assumption of stochastic independ- 
ence. Our model can be refined to take care of these objections, but 
the essential features remain unaffected. We shall derive the prob- 
ability of finding & carriers of the family name in the nth generation 
and, in particular, the probability of an extinction of the line. Survival 
of family names appears to have been the first chain reaction studied 
by probability methods. The problem was first treated by F. Galton 
(1889) ; for a detailed account the reader is referred to A. Lotka’s book.‘ 


3The following description follows E. Schroedinger, Probability problems in 
nuclear chemistry, Proceedings of the Royal Irish Academy, vol. 51, sect. A, No. 1 
(December 1945). There the assumption of spatial homogeneity is removed. 

‘Théorie analytique des associations biologiques, vol. 2, Actualités scientifiques 
et industrielles, No. 780 (1939), pp. 128-186, Hermann et Cie, Paris. 
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Lotka shows that American experience is reasonably well described by 
the distribution pp = 0.4825, py = (0.2126) (0.5893)*—1(k > 1), which, 
except for the first term, is a geometric distribution. 

(c) Genes and mutations. Every gene of a given organism (cf. chap- 
ter V, section 5) has a chance to reappear in 1, 2, 3, ... direct descend- 
ants, and our scheme describes the process, neglecting, of course, varia- 
tions within the population and with time. This scheme is of particu- 
lar use in the study of mutations,® or changes of form in a gene. A 
spontaneous mutation produces a single gene of the new kind, which 
plays the role of a zero-generation particle. The theory leads to esti- 
mates of the chances of survival and of the spread of the mutant gene. 
To fix ideas, consider (following R. A. Fisher) a corn plant which 18 
father to some 100 seeds and mother to an equal number. If the popu- 
lation size remains constant, an average of two among these 200 seeds 
will develop to a plant. Each seed has probability 4 to receive a par- 
ticular gene. The probability of a mutant gene’s being represented in 
exactly k new plants is therefore comparable to the ΤΡΟ ΒΘ Ν᾽ οἵ 
exactly k successes in 200 Bernoulli trials with probability p = xO0) 
and it appears reasonable to assume that {p,} is, approximately, a 
Poisson distribution with mean 1. If the gene carries a biological ad- 
vantage, we get a Poisson distribution with mean A > 1. 

(4) Waiting lines® The theory of branching processes is useful for 
the analysis of fluctuations in waiting lines (in post offices, telephones, 
etc.). A customer arriving at an empty counter and having no waiting 
time is termed ancestor; the customers arriving during the ancestor’s 
service time and joining in the queue are his direct descendants. The 
process continues as long as the queue lasts. In this example we are 
interested in the total progeny up to the moment of expiration. 


5. EXTINCTION PROBABILITIES IN BRANCHING 
PROCESSES 


For a mathematical description of the process let X, represent the 
size of the nth generation. By assumption Χο = 1, and X, has the een 
probability distribution {p,} and generating fanetion P(s) = Zpxs*. 
The second generation consists of the direct descendants of the X, 
members of the first generation; in other words, we consider X2 as the 
sum of X, mutually independent variables each having the generating 
function P(s). By the theorem of section 1 the generating function of 


5R. A. Fisher, The genetical theory of natural selection, Oxford, 1930, pp. 73ff. 
6 Ὁ). G. Kendall, Stochastic processes and population growth, Journal Royal 


Statistical Society, vol. 11 (1949), pp. 250-268. 
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Χο is therefore P2(s) = P(P(s)). In like manner Xs; is the sum of X, 
variables each having the same distribution as Xe, and so the generating 
function of X3 is P3(s) = P(P.2(s)). By induction we see that in general 
the generating function P,+1(s) of the number Xn41 of particles in the 
(n+1)st generation ts defined recursively by 


(5.1) Py(s) = P(s), — Pn4i(s) = P(Px(s)). 


In example (4.a) P(s) = q + ps”; and hence Pa(s) = ᾳ + p(q + ps”), 
P3(s) = q+ ρίᾳ + ρίᾳ + ps”)”}™, etc. For a Poisson distribution 
P(s) = e499 Po(s) = e>te™ | ete. These formulas are not 
very pleasing but enable us to draw important conclusions. 

We seek the probability x, that the process terminates at or before 
the nth generation, that is z, = P{X, = 0} = P,(0). No extinction 
is possible when pp = Ὁ and we shall therefore assume that 0 < po < 1. 
It is clear from its definition that x, increases with n. This can be seen 
analytically as follows. In the interval 0 « 5 < 1 the function P(s) is 
increasing and we have «1 = P(Q) = po. Therefore rg = P(x) > 
> P(O) = 2 and by induction 224; = P(tn) > P(tn_1) = tn. It fol- 
lows that the sequence x, increases monotonically to a number ¢, and 
obviously ¢ satisfies the equation 


(5.2) f= P(¢). 


If u>O0 is an arbitrary root of the equation u = P(u), then 
σι = P(O) < P(u) = wand so by induction 2,41; = P(an) < P(u) = u, 
which shows that ¢ < u. Accordingly, 2, tends to the smallest positive 
root of (5.2). 

The graph of y = P(s) being convex, the curve and the bisector y = s 
can intersect in at most two points. They do intersect at the point 
(1,1) and therefore the equation (5.2) can have at most one root 
0<¢ <1. When such a root exists, the difference ratio {1 — P(¢)}/ 
/{1 -- 8 equals one, and by the mean value theorem there exists a 
point x lying between ¢ and 1 such that the derivative P’(x) = 1. It 
follows that a root ¢ < 1 of (5.2) can exist only if P’(1) > 1. On the 
other hand, if P’(1) < 1 then {1 — P(s)}/{1 — s} <1 for all 8 « 1, 
and this implies P(s) > 1; the graph of P(s) lies above the bisector 
and hence (5.2) can have no root. This shows that a positive root 
§ «1 of (6.2) exists if, and only if, P’(1) > 1, and that this root is 
unique. Now P’(1) = Σζρᾳ is the expected number of direct descend- 
ants of each particle, and we can formulate the basic result: 


Let μ = Zkp,, be the expected number of direct descendants of a single 
particle. If μ < 1, then the probability tends to one that the process will 
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terminate before the nth generation (that is, X, = 0). Jf u > 1, then 
there exists a unique root ¢ < 1 of (5.2), and ¢ 18 the limit of the prob- 
ability that the process terminates after finitely many generations. 


The difference 1 — ¢ can be called the probability of an infinitely 
prolonged process. Usually x, converges to ¢ rapidly, so that a ter- 
minating process is likely to proceed for only very few generations. In 
practice, therefore, ¢ is the probability of a rapid extinction. In exam- 
ple (4.c) we may call 1 — ¢ the probability that a mutant gene estab- 
lishes itself. If we start with r particles instead of a single one, the 
probability that all r descendant lines die out is ζῇ, and the probability 
of at least one being successful is 1 — ¢". Even if ¢ is relatively large, 
1 — ¢" is near 1 if the initial number r is large. In the nuclear chain 
reaction of example (4.a) this is always the case, and hence we can say: 
If » > 1, the probability of an explosion is near 1, but for » < 1 the 
probability is 1 that the process stops after a finite number of genera- 
tions. 

We can also find the expected size of the nth generation E(X,) = 
= P’,(1). Since P,(s) = P(Pn—1(s)), we find 


P'n(1) = P’(Pa1())P’n .(1) = P')P’n (1) = μΕ(Σ,-»), 
and generally by induction 
(5.3) E(X,) = p”. 


Hence, if » > 1, we should expect an exponential growth. This argu- 
ment can be amplified. It is easily seen that not only P,,(0) — ¢ but 
also P,(s) — ¢ for all s <1. This means that the coefficients of 
8, 82, s°, ... tend to zero. After a large number of generations the 
probability that no descendants exist is near ¢, and the probability that 
the number of descendants exceeds any preassigned bound 1s near 1 — ¢; 
it is exceedingly improbable to find a moderate number of descendants.’ 


6. PROBLEMS FOR SOLUTION 


1. The distribution (1.1) has mean E(N)E(X) and variance E(N) Var(X) + 
+ Var(N) E?(X). Verify this (a) using the generating function, (6) directly 
from the definition and the notion of conditional expectations. 

2. Animal trapping [example (1.b)]. If {gn} is a geometric distribution, so 
is the resulting distribution. If {gn} is a logarithmic distribution [cf. formula 
(2.4)], there results a logarithmic distribution with an added term. 


7 For the behavior of X, see T. E. Harris, Branching processes, Annals of Mathe- 
matical Statistics, vol. 19 (1948), pp. 474-494. 
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3. In N Bernoulli trials, where N is a random variable with a Poisson dis- 
tribution, the numbers of successes and failures are stochastically independent 
variables. Generalize this to the multinomial distribution (a) directly, (6) 
using multivariate generating functions. [Cf. example IX(1.d).] 

4. Randomization. Let N have a Poisson distribution with mean Δ, and 
let N balls be placed randomly into n cells. Show without calculation that 


the probability of finding exactly m cells empty is (7) eAmini(] -- @)—Alnjn—m, 


5. Continuation.’ Show that when a fixed number r of balls is placed ran- 
domly into n cells the probability of finding exactly m cells empty equals the 
coefficient of e~*X"/r! in the expression above. (a) Discuss the connection 
with moment generating functions (problem XI, 24). (b) Use the result for an 
effortless derivation of formula II(11.7). 


6. Mixtures of probability distributions. Let {f;} and {g;} be two probabil- 
ity distributions, a>0, 8 >0,a+8=1. Then {af;+g,} is again a 
probability distribution. Discuss its meaning and the connection with the 
urn models of chapter V, section 2. Generalize to more than two distributions. 
Can such a mixture be a compound Poisson distribution? 

7. In the branching process prove that Var(Kn41) = μ Var(Kn) + μϑησϑ, 
using (a) generating functions, (6) conditional expectations. Conclude that 
Var(Xn) = a? (u?™—? + y?"-3 4 ye), 

8. Continuation. Ifn > m show that E(X,X,) = u”~"E(Xn2). 

9. Continuation. Show that the bivariate generating function of Xn, Xn is 
Pn(s1P-n—n(82)). Use this to verify the assertion in 8. 


8 This elegant derivation of various combinatorial formulas by randomizing a 
parameter is due to C. Domb, On the use of a random parameter in combinatorial] 
problems, Proceedings Royal Philosophical Society, Sec. A., vol. 65 (1952), pp. 
305-309. 


CHAPTER XIII 


Recurrent Events. 
The Renewal Equation 


1. INFORMAL PREPARATIONS AND EXAMPLES 


We shall be concerned with certain repetitive patterns connected 
with repeated trials. Roughly speaking, a pattern & qualifies for the 
following theory if after each occurrence of & the trials start from 
scratch in the sense that the trials following an occurrence of & form 
a replica of the whole experiment. The waiting times between succes- 
sive occurrences of & are mutually independent random variables hav- 
ing the same distribution. 

The simplest special case arises when & stands as abbreviation for 
‘a success occurs” in a sequence of Bernoulli trials. The waiting time 
up to the first success has a geometric distribution; when the first suc- 
cess occurs, the trials start anew, and the number of trials between the 
rth and the (r+1)st success has the same geometric distribution. The 
waiting time up to the rth success is the sum of 7 independent variables 
[example IX(8.c)]._ By contrast, suppose that people are sampled one 
by one and let 8 stand for “Το people in the sample have birthdays 
the same day of the year.”’ Here δ is not repetitive; once it has occurred 
it persists. The sampling may proceed until a second double birthday 
turns up, but this second phase is not a replica of the first one. The 
larger a sample, the greater the probability of a duplication of birth- 
days; therefore a long waiting time for the first double birthday prom- 
ises a short interval between the first and the second duplication. The 
two consecutive waiting times not only have different distributions but 
are stochastically dependent. Such waiting times are not the object 
of the theory of recurrent events. 

A phenomenon of a different type occurs when we are interested in 
the appearance of two consecutive successes in Bernoulli trials. The 
first occurrence of the pattern SS is well defined, but if ὃ stands for “ἃ 
run of exactly two successes,”’ the third trial may undo the second; if 
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four successive trials produce the sequence SSSF, then § occurs at the 
second trial, but the whole sequence contains no ὃ. For us it is im- 
portant that the event “δ᾽ occurs at the nth trial” depends solely on 
the outcome of the first n trials and not on the future. 

A few typical problems to which the theory of recurrent events does 
apply are listed in the following 


Examples. (a) Success runs in Bernoulli trials. The term “success 
run of length γ᾽ has been defined in several ways. It is largely a matter 
of convention and convenience whether a sequence of three consecutive 
successes is said to contain 0, 1, or 2 runs of length 2, and for different 
purposes different definitions have been adopted. However, if we are 
to use the theory of recurrent events, then the notion of runs of length 
r must be defined so that we start from scratch every time a run is 
completed. This means adopting the following definition. A sequence 
of n letters S and F contains as many runs of length r as there are non- 
overlapping uninterrupted successions of exactly r letters S. In a sequence 
of Bernoulli trials.a run of length r occurs at the nth trial, if the nth trial 
adds a new run to the sequence. Thus in SSS|SF|SSS|SSS we have 
three runs of length 3, and they occur at trials number 3, 8, 11; there 
are five runs of length 2, and they occur at trials number 2, 4, 7, 9, 11. 
This definition has the advantage of a considerable simplification of 
the theory since runs of a fixed length become recurrent events. (This 
topic will be taken up in sections 7 and 8.) 

(Ὁ) A counter problem. Counters of the type used for cosmic rays 
and «-particles may be described by the following simplified model. 
Bernoulli trials are performed at a uniform rate. A counter is designed 
to register successes, but the mechanism is locked for exactly r — 1 
trials following each registration. In other words, a success at the nth 
trial is registered if, and only if, no registration has occurred in the pre- 
ceding r — 1 trials. The counter is then locked at the conclusion of 
trials number n, ..., n + r — 1, and is freed at the conclusion of the 
(n + τ). trial provided this trial results in failure. The output of the 
counter represents dependent trials; each registration has an after- 
effect. However, whenever the counter is free (not locked) the situa- 
tion is exactly the same, and the trials start from scratch. Letting & 
stand for “at the conclusion of the trial the counter is free,” we have 
a typical recurrent pattern (cf. problems 9 and 10 and XV, 13). 


* We are describing a discrete analogue of the so-called counters of type I. Type 
II is described in problem 10. For a description see H. Maier-Leibnitz, Die Koin- 
zidenzmethode und ihre Anwendung auf kernphysikalische Probleme, Physikalische 
Zettschrift, vol. 43 (1942), pp. 333-362. 
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(c) Return to the origin. In a sequence of Bernoulli trials with prob- 
ability p of success let ὃ stand as an abbreviation for ‘“The cumulative 
numbers of successes and failures are equal.’”’ As we have done before, 
we describe Bernoulli trials in terms of independent variables {X;} 
with the common distribution P{X;, = 1} = p and P{X, = —1} = 4, 
and put 


Then S,, is the accumulated excess of heads over tails, and our & occurs 
at the nth trial if, and only if, Sn = 0. We shall describe ὃ as the return 
to 0. Given that S, = 0, the subsequent partial sums 


(1.2) So Ξ- Sn) S11 = Sn41) 5. ΞΞ Sng; eee 


are subject to exactly the same probability relations as the original 
sequence {S;}, and a return to Ὁ for {S’,} means a return to 0 for 
{S;} and vice versa. 

The event “ὃ occurs for the first time at the nth trial” alias ‘The 
first return to the origin takes place at the nth trial’ is defined as the 
aggregate of sequences {X;} such that 


(13) S:*0, SeX0, ..., Snr¥0, Sr=0. 


If this occurs we say that the waiting time T equals n, and for the 
probability of (1.3) we write f, = P{T =n}. The first few terms are 
easily found by direct enumeration of all admissible sequences; clearly 
fn = 0 whenever n is odd and fo = 2pq, fa = 2ρ" α", fe = 4ρ" ᾳ", 
fs = 10p*g*, fio = 28p°g°. The same sequence {f,} represents the 
probability distribution of the waiting time between the rth and the 
(r-+-1)st occurrence of δ, and we call {f,} also the distribution of recur- 
rence times. (The distribution {f,} has been found in chapter XI, sec- 
tion 3, by the use of generating functions. In chapter III the special 
case p = g = 4 is treated, and the formulas apply in general since the 
number of outcomes satisfying (1.3) is independent of p. In the pres- 
ent chapter we give a new and independent derivation.) 

(d) Ladder points in Bernoulli trials. Adhering to the same notations 
we define a new repetitive pattern ὃ by “& occurs at the nth trial of Sn 
exceeds all preceding sums” that is, if | 


(14) ὅ,.»Ὸ, S,>Si1, Sn>Ss,--., Sn >Sn1. 


In this case we shall say that the nth trial (or the index n) represents 
a, ladder point. In the sequence of partial sums S,, Se, ... given by 
—1, 0, 1]2]1, 2, 3)2, 1, 2, 1, 2, 8, 415] (see figure 3 of chapter IIT) 
ladder points occur at the trials number 3, 4, 7, 14, 15, and the wait- 
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ing times between consecutive occurrences are 3, 1, 3, 7, 1. The rth 
occurrence of & can be described as the first occurrence of the value r, 
and therefore the ladder points may be described as moments of first 
passages. 

If & occurs at the nth trial the process starts from scratch in the 
following sense. Assuming (1.4) to hold, a later trial number n + m is 
a ladder point of, and only 1, 


(1.5) in > Sn, Sn > S241) Snim > Sn2) eeey Sn4im > ᾿ΕΠΕῚ 
Put 
(1.6) S;.* = Sn+k is S, a Xn+1 + ete a Xn+k- 


Then n + mis a ladder point for the sequence §), So, ... if, and only 
if, m is a ladder point for {S,,*}. Clearly the operation defined in (1.6) 
produces an independent replica of the original sample space, and & 
qualifies for the theory of recurrent events. Note that in this case the 
sequence (1.2) as such is probabilistically different from the original 
sequence: after the rth occurrence of & the partial sums S; are bound 
to be close to r and not to 0. Nevertheless, as far as our pattern & is 
concerned, the trials following the occurrence of & start from scratch. 

(The ladder points provide a means of reducing the study of first- 
passage times to recurrent events, that is, to the summation of inde- 
pendent random variables. A direct (equivalent) approach is given 
in chapter [X, section 8. The notion of ladder points can be used prof- 
itably for sequences of arbitrary random variables, for example in 
connection with the general arc sine law.) 

(6) In a sequence of consecutive throws of a perfect die let 8 stand 
for “Ones, twos, ..., sixes appeared in equal numbers.”’ Here the 
recurrent character of & requires no further comment. 


2. DEFINITIONS 


We consider a sequence of repeated trials with possible outcomes 
E; G = 1, 2, ...). They need not be independent (applications to 
Markov chains being of special interest). As usual, we suppose that 
it is in principle possible to continue the trials indefinitely, the prob- 
abilities P{H;,, Εν» ..., #;,} being defined consistently for all finite 
sequences. Let ὃ be an attribute of finite sequences; that is, we sup- 
pose that it is uniquely determined whether a sequence (E,,, ..., ΒΕ) 
has, or has not, the characteristic 8. We agree that the expression “ὃ 
occurs at the nth place in the (finite or infinite) sequence E;,, E;,, ...” 
is an abbreviation for “The subsequence £;,, H;,, ..., E;, has the 
attribute &.” This convention implies that the occurrence of & at the 
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nth trial depends solely on the outcome of the first n trials. It is also 
understood that when speaking of a “recurrent event &,’’ we are really 
referring to a class of events defined by the property that & occurs. 
Clearly 8 itself is a label rather than an event. We are here abusing 
the language in the same way as is generally accepted in terms such 
as “8. two-dimensional problem’’; the problem itself is dimensionless. 


Definition 1. The attribute & defines a recurrent event tf: 

(a) In order that & occurs at the nth and the (n+m)th place of the 
sequence (H;,, Ej,, ..+, Ej,,_) 1ὲ is necessary and sufficient that & occurs 
at the last place in each of the two subsequences (Hj;,, E;,, ..., H;,) and 
(Ling Ein 42s ἐδ Ein sm) 

(b) Whenever this is the case we have 


P{H;, ..., Ἐπ} = P(E; ..-, Bj,} Pl; 


In+m 
It has now an obvious meaning to say that & occurs in the sequence 
(E;,, Ej,, ...) for the first time at the nth place, etc. It is also clear 
that with each recurrent event & there are associated the two sequences 
of numbers defined for n = 1, 2, ... as follows 


ng 5} | ae ἢ 


2.1) Un = P{& occurs at the nth trial}, 
) fn = P{& occurs for the first time at the nth trial}. 


It will be convenient to define 
(2.2) fo = 0, uy = 1, 


and to introduce the generating functions 
(2.3) F(s) = Shs, στ = DD μοὶ, 
k=1 k=0 


Observe that {u;} is not a probability distribution; in fact, in rep- 
resentative cases we shall have Zu; = ©. However, the events “ὃ 
occurs for the first time at the nth trial’ are mutually exclusive, and 
therefore 

co 
(2.4) f= Lis ΞῚ. 
n=1 
It is clear that 1 — f should be interpreted as the probability that & does 
not occur in an indefinitely prolonged sequence of trials. If f = 1 we 
may introduce a random variable T with distribution 


(2.5) P{T =n} = fa. 


We shall use the same notation (2.5) even if f < 1. Then T 2s an wm- 
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proper, or defective random variable, which with probability 1 — f does 
not assume a numerical value. (For our purposes we could assign to T 
the symbol ©, and it should be clear that no new rules are required.) 

The waiting tume for ὃ, that is, the number of trials up to and in- 
cluding the first occurrence of ὃ, is a random variable with the dis- 
tribution (2.5); however, this random variable is really defined only 
in the space of infinite sequences (H;,, ΕἾ.» ...). 

By the definition of recurrent events the probability that & occurs 
for the first time at trial number é and for the second time at the nth 
trial equals fifn—z. Therefore the probability f@ that § occurs for 
the second time at the nth trial equals 


(2.6) τ - fifn—1 + fofn—2 ge ee fa—-sthi. 


The right side is the convolution of {f,} with itself and therefore 
{f2 represents the probability distribution of the sum of two inde- 
pendent random variables each having the distribution (2.5). More 
generally, if f‘ is the probability that the rth occurrence of & takes 
place at the nth trial we have 


(2.7) © Ξε Sifsae + heft +...4+fraff. 
This simple fact is expressed in the 


Theorem. Let f© be the probability that the rth occurrence of & takes 
place at the nth trial. Then {72} is the probability distribution of the sum 


(2.8) PO et seh a ap. 


of r independent random variables Τὶ, ..., T, each having the distribution 
(2.5). In other words: For fixed r the sequence {f©} has the generating 


function F'(s). 
It follows in particular that 
(2.9) > fP =F) =f: 
n=l 


the probability that & occurs at least r times equals f’ (a fact which 
could have been anticipated). We now introduce 


Definition 2. A recurrent event & will be called persistent? if f = 1 
and transient if «1. 


2 In the first edition the terms certain and uncertain were used, but the present 
terminology is preferable in applications to Markov chains. 
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For a transient ὃ the probability that it occurs more than r times 
tends to zero, whereas for a persistent & this probability remains unity. 
This can be described by saying with probability one: A persistent & is 
bound to occur infinitely often whereas a transient & occurs only a finite 
number of times. (This statement not only is a description but is for- 
mally correct if interpreted in the sample space of infinite sequences 
Os gga a) 

We require one more definition. In Bernoulli trials a return to the 
origin [example (1.c)] can occur only at an even-numbered trial. In 
this case fon41 = Uen4i = 0, and the generating functions F(s) and 
U(s) are power series in s* rather than 8. Similarly, in example (1.6) 
& can occur only at trials number 6, 12, 18, .... We express this by 
saying that ὃ is periodic. Such recurrent events have a great nuisance 
value; in each instance the situation is quite obvious, but all general 
theorems require mention of the nominally special case of periodicity. 


Definition 3. The recurrent event & is called periodic if there exists 
an integer ἃ > 1 such that & can occur only at trials number d, 2d, 3d, ... 
(1.€., Un = O whenever n is not divisible by Δ). The greatest \ with this 
property 18 called the period of ὃ. 


In conclusion let us remark that in the sample space of infinite se- 
quences H;,, H;,, ... the number of trials between the (r—1)st and the 
rth occurrence of & is a well-defined random variable (possibly a defec- 
tive one), having the probability distribution of our T,. In other 
words, our variables T, really stand for the waiting times between the 
successive occurrences of & (the recurrence times). We have defined the 
T, analytically in order not to refer to sample spaces beyond the scope 
of this volume, but it is hoped that the probabilistic background ap- 
pears in all its-intuitive simplicity. The notion of recurrent events is 
designed to reduce a fairly general situation to sums of independent 
random variables. Conversely, an arbitrary probability distribution 
{fn}, ἢ = 1, 2, ... may be used to define a recurrent event. We prove 
this assertion by the : 


Example. Self-renewing aggregates. Consider an electric bulb, fuse, 
or other piece of equipment with a finite life span. As soon as the 
piece fails, it is replaced by a new piece of like kind, which in due time 
is replaced by a third piece, and so on. We assume that the life span 
is a random variable which ranges only over multiples of a unit time 
interval (year, day, or second). Each time unit then represents a trial 
with possible outcomes “replacement”? and “no replacement.’”’ The 
successive replacements may be treated as recurrent events. If f, is 
the probability that a new piece will serve for exactly n time units, 
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then {f,} is the distribution of the recurrence times. When it is cer- 
tain that the life span is finite, then =f, = 1 and the recurrent event 
is persistent. Usually it is known that the life span cannot exceed a 
fixed number m, in which case the generating function F(s) is a poly- 
nomial of a degree not exceeding m. In applications we desire the 
probability u, that a replacement takes place at time n. This u, may 
be calculated from equation (3.1). Here we have a class of recurrent 
events defined solely in terms of an arbitrary distribution {f,}. The 
case f < 1 is not excluded, 1 — f being the probability of an eternal 
life of our piece of equipment. 


3. THE BASIC RELATIONS 


We adhere to the notations (2.2)-(2.4) and propose to investigate 
the connection between the {f,} and the {u,}. The probability that 
& occurs for the first time at trial number ν and then again at a later 
trial n > ν is, by definition, f,u,_,. The probability that & occurs at 
the nth trial for the first time is fn = fru. Since these cases are 
mutually exclusive we have 


(3.1) Un = filn—1 + fotn—o +... + fnuo, n> tI. 


At the right we recognize the convolution { f,}* {ux} with the generating 
function F(s) U(s). At the left we find the sequence {u,} with the 
term Up missing, so that its generating function is U(s) —1. Thus 
U(s) — 1 = F(s) U(s), and we have proved 


Theorem 1. The generating functions of {un} and {fn} are related by 
3.2 U(s) = ————_- 
(3.2) Car το 


Note. The right side in (3.2) can be expanded into a geometric series 
=F"(s) converging for [8] « 1. The coefficient f of 57 in F’(s) being 
the probability that the rth occurrence of & takes place at the nth 
trial, equation (3.2) is equivalent to 


(3.3) ieee Page ay age oe 


and expresses the obvious fact that if & occurs at the nth trial, it has 
previously occurred 0, 1, 2, ..., n—1 times. (Clearly f® = 0 for 
r>n.) 


Theorem 2. For ὃ to be transient, τέ 1s necessary and sufficient that 


(3.4) u= > 4; 
j=0 


286 RECURRENT EVENTS [XIII.3 


as finite. In this case the probability f that & ever occurs 18 given by 


u—l 


(3.5) f= 


U 


Note. We can interpret u; as the expectation of a random variable 
which equals 1 or 0 according to whether & does or does not occur at 
the jth trial. Hence w, + we +...+ u, is the expected number of 
occurrences of & in 7 trials, and « — 1 can be interpreted as the ex- 
pected number of occurrences of & in infinitely many trials. 


Proof. The coefficients u; being non-negative, it is clear that U(s) 
increases monotonically as s — 1 and that for each N 


N 00 
>) Un < lim U(s) < >) Un = U. 
n=0 ope n==0 
Since U(s) — (1 — f)~' when f < 1 and U(s) — οο when f = 1, the 
theorem follows. 


The next theorem is of particular importance. The proof is of an 
elementary nature, but since it does not contribute to a probabilistic 
understanding we defer it to the end of the chapter. (See, however, 
problem 1.) | 


Theorem 3. Let ὃ be persistent and not periodic and denote by μ the 
mean of the recurrence times T,, that 18, 


(3.6) μ τ Σ7 = Ε΄ () 
(possibly μ = 0). Then 
(3.7) Un — yt 


asn -- © (Un — 0 df the mean recurrence time 18 infinite). 


3P. Erdés, W. Feller, and H. Pollard, A theorem on power series, Bulletin of the 
American Mathematical Society, vol. 55 (1949), pp. 201-204. This theorem was 
conjectured and proved for the purpose of obtaining a better access to ergodic 
properties of infinite Markov chains established by Kolmogorov. After the ap- 
pearance of the first edition it was observed by K. L. Chung that theorem 3 is 
really equivalent to the ergodic theorem of Kolmogorov and could be deduced from 
it. Previously a great many papers were devoted to various special cases and vari- 
ants. Later, theorem 3 was generalized to continuous random variables and made 
more precise in various ways by Blackwell, Chung, Erdés, and Wolfowitz. Black- 
well gave an elegant simple proof that (3.7) holds for all integral-valued random 
variables (not necessarily positive ones as in the text), provided they have a posi- 
tive mean. His method is based on the use of ladder points for arbitrary variables 
[ef. example (1.d)]. See Ὁ. Blackwell, Extension of a renewal theorem, Pacific 
Journal of Mathematics, vol. 3 (1953), pp. 315-320. 
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Theorem 4. If & 15 persistent and has period ἃ > 0, thenasn — © 
(3.8) Un. —> dp? 


and uy = 0 for every k not divisible by Δ. 


Proof. Since & has period i, the series F(s) = 2f,s” contains only 
powers of s*, and so F(s!/*) = F(s) where F;(s) is again a power series 
with positive coefficients, and F,(1) = 1. Theorem 3 implies that the 
coefficients of U;(s) = {1 — F,(s)}— tend to u;—! where 


μι = F’(1) = ΧΡ) = AW. 


(Clearly » and yw; are either both finite or both infinite.) Now 
U(s) = U,(s*) and so (3.8) holds. 


Examples. (a) For a trite example let ὃ stand for “success” in 
Bernoulli trials. Then u, = p, by the very definition. Theorem 2 
states that the expected number of trials between two consecutive suc- 
cesses is » .. Here U(s) = 1+ ps(1 — s)7! = (1 — gs)(1 — s)7}, 
and from theorem 1 we conclude that F(s) = ps(1 — φ8) 1, showing 
that the waiting time between consecutive successes has a geometric 
distribution. 

(b) Return to the origin in Bernoulli trials [example (1.c)]. If at the 
kth trial the cumulative numbers of successes and failures are equal, 
then k must be an even number, k = 2n, and n trials must have resulted 
in success, the other 7 in flare: Therefore we have for the ῬΙΘΌΒΌΠΗΥ͂ 
of an equalisation 


an 
(3.9) Ven = ( re 


nr 


We know from the normal approximation to the binomial distribution, 
and we can also readily verify using Stirling’s formula, that 


2n 
(3.10) ( ) Q-2n ' 
n (πη)Ὁ} 
so that 4 y= | 
Pq 
(3.11) | Uon ™~ Gat 


the sign ~ indicating that the ratio of the two sides tends to unity. 

If p ~ $, then 4pq < 1, and Zuo, converges faster than the geometric 
series with ratio 4pq. If p = 4, then won ~ (πη) ἢ; hence Tue, di- 
verges, but ven — 0. Our theorems permit the conclusion that with 
probability one the following is true: 
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If p # q, then the cumulative sums S, will vanish only finitely many 
times. If p = q = 3, they will pass through Ὁ infinitely often, but the 
mean recurrence time 1s infinite. 

In the case Ὁ τέ q the assertion is obvious intuitively and follows 
also from the strong law of large numbers. In gambling language, if 
the game is favorable for Peter, he can rest assured that after a few 
initial fluctuations his net gain will be positive and remain so. When 
p = q = 3, the situation is much less intuitive and is the source of the 
paradoxical features of the fluctuations in coin tossing described in 
chapter III, section 7. 

The theorems above can supply additional information. Using the 
readily verified formula 


am Ὁ ὦ 


and the binomial expansion II(8.7), we get from (3.9) 
(3.13) U(s) = So Uops?™ = (1 — 4pgs”) 7. 
=0 


If p ~ 4, then u = U(1) = (1 — 4pq)? =|p — |. From (8.8) 
we conclude that the probability f that the accumulated numbers of suc- 
cesses and failures will ever equalize is given by 


(3.14) f=1—-|p-4q|. 


(This is the probability of at least one return to the origin.) 
From (8.2) we get for the generating function of the recurrence tumes 


(3.15) F(s) = 1 — (1 — 4pqs’)}. 
This formula is most interesting in the case p = ᾳ = 3%. Then 
(3.16) F(s) = 1—(1— 3) 


and the binomial expansion shows that 


(3.17) fon = (-1)"1? ee = -( " Ἵ g—2n+1 


n n\n—-l1 


(fn vanishes whenever n is odd). Equation (8.17) gives the distribution 
of the recurrence times for the return to the origin in the classical coin- 
tossing game. 

(We have obtained this formula by different methods in chapter III, 
section 4, and chapter XI, section 3. The present method, although 
not the most elementary, is the most straightforward.) 
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(c) Ties in multiple coin games. We consider repeated independent 
tossings of two coins and say that & has occurred whenever the accu- 
mulated number of heads (and therefore of tails) is the same for both 
coins. Clearly 


a a(t 0. ΟἹ 


Using IT(12.11) and (8.10), we find that 


2n 
(3.19) Un = ( yom ἊΝ 


n 


(ne)} 


Hence Zu, diverges, but u, — 0. Therefore ὃ 18 persistent but has 
unfinite mean recurrence time. 

More generally, consider the simultaneous tossing of r coins, and let 
& stand for the recurrent event that all r coins are in the same phase 
(accumulated numbers of heads are the same for all coins). Then 


0 web 00 ...0] 


To estimate u, note that the maximal term of the binomial distribution 


( 2. 7 is smaller than n~?. Therefore 


aay ταν) (τον (J) me 


Accordingly Zu, converges if r>4. For r = 2 we saw that du, 
diverges. A special consideration is necessary for the case r = 3. 
From the normal approximation to the binomial distribution we know 
that for sufficiently large n and values of k lying between $n — πὲ and 


n 
tn + nt we have [ 2.5.» οηΠ}, where c is a positive constant (say 
e~*). Therefore, when r = 3, 
(3.22) Un > 2ηλ(εθη, Ὦ = 23 /n, 


and hence Zu, diverges. In other words, the recurrent event & that k 
coins show the same cumulative numbers of heads is persistent if, and only 
uf, k <3. The mean recurrence time is infinite in each case. 

(d) Dice. In example (1.e) we considered the recurrent event ὃ that 
the accumulated numbers of aces, twos, threes, etc., are equal. Obvi- 
ously ὃ has period 6 and uegn = (θη) (ν}) 6 ὅπ, Using Stirling’s for- 
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mula, we readily find that wen is of the order of magnitude n~?, so that 
ZUn converges. Hence ὃ is transient. From (3.7) it is easy to calcu- 
late that the probability of a recurrence is about 0.022. 

(e) For applications to the theory of runs see sections 7 and 8. 


4. THE RENEWAL EQUATION 


The basic equation (3.1) of the theory of recurrent events is a special 
case of the so-called renewal equation (4.1), which is encountered in 
many different connections. We proceed to show that the theorems 
of the last section apply without essential modification to this more 
general equation. The discussion will be of a purely analytic character, 
probabilistic interpretations and applications being reserved for the 
next section. 

Let {an} and {b,} be two sequences such that Ὁ < an < 1 and b, > 0 
(where n = 0, 1, 2, ...). A third sequence {un} ts defined by the recur- 
sive relations 


(4.1) Un = bn + (aotn + Q1Un1 +...+ AnUo) 
or 
(4.2) {un} = {bn} + {an}*{un}. 


Solving (4.1) successively, we get 
Uo = bo/(1 — ao), Uy = (δι + ayuo)/(1 — ao), ..., 


so that no problem about the existence of a unique solution {u,} arises. 
We are interested in the behavior of {u,} as n — οὐ, a problem to 
which a great number of papers (mostly of controversial nature) have 
been devoted. | 

Setting ὃ, = 0,a, = fr,forn = 1, 2,... and bo = 1, ag = 0 reduces 
equation (4.1) to (8.1). Formally, therefore, the renewal equation (4.1) 
is more general, but we shall derive its properties from those of (3.1). 
Once more we introduce the generating functions 


(4.3) A(s) = 2a, s", B(s) = Σὺ, 57, U(s) = Duns”. 


The coefficients a, and b, being bounded, the first two series converge 
at least for [8] « 1; the convergence of the last series will presently 
become evident. Equation (4.1) can now be rewritten in the form 
U(s) = B(s) + A(s)U(s) or 


B(s) 
(4.4) U(s) ΞΞ TAG). 
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For B(s) = 1 this reduces to (8.2) with the essential difference that 
now {a,} is not necessarily the distribution of a recurrence time, so 
that A(s) can be larger as well as smaller than 1. 

We shall say that we have the periodic case if there exists an integer 
» > 1, such that all ας except, perhaps, a, G2, 43,, ... vanish. Then 
A(s) is a power series in s*. The largest integer ἃ with the said property 


is called the period. 
Theorem 1. Suppose that {an} is not periodic and that B(1) = Zbp 


as finite. 
(a) If Za, = 1, then 
(4.5) Un > Bil)p' ~=where —s_ wp = -ENda. 


(In particular, un — 0 4f Una, diverges.) 
(Ὁ) If Zan < 1, then the series 


(4.6) Zun = B(1){1 — A(1)}™ 


converges. 
(c) If Za, > 1 and also tf the series diverges, then there exists a unique 
positive root x < 1 of the equation A(x) = 1. In this case 


BD ch 
Α΄ (ἃ) 


(4.7) 


the sign ~ indicating that the ratio of the two sides tends to unity. 
(Relation (4.7) implies that wu, increases geometrically; the derivative 
A'(x) is finite since A(s) is regular for [8] < 1.) 


Proof. (a) If w is the coefficient of s” in {1 — A(s)}—', then 
Un — pw? by theorem 8 of the last section. Now 
(4.8) Un = Unbo ++ Un— 0, +...t+ Vo0n. 


For every fixed & the term v,,_,b, tends to by/np asn — «©. Moreover, 
the v, are bounded. It follows that, for N sufficiently large, u, differs 
arbitrarily little from 


(4.9) U'n = Unbo + Un—101 + δον + Un—nobn, 
and μ΄, — (bo +...+ by)/u which in turn differs arbitrarily little 
from B(1)/p. 


(b) Here the proof of theorem 2, section 3, applies without modifica- 
tion. 

(c) Here it suffices to apply the result under (a) to the sequences 
{anv}, {bnx"}, and {u,x”} which have the generating functions A (zs), 
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B(xs), and U(xs), and which are obviously related in the same way as 
the original sequences. 

Unfortunately completeness requires a special mention of periodic 
sequences {a,} where A(s) = Zan,s™ is a power series in s*. In this case 
we divide the coefficients u, into groups of equal phase, {up, wy, Uoy, 
U3r) ++ ic ἴτω, Ud+1, Δ2λ-Ε1) UBA41) «+> if ".42) ἰμλ. ., 1(2λ.--1Ἰ) UZr-1; +++ }. 
It is obvious from (4.4) that the coefficients u,, depend only on bo, 
by, be, ... but not on the b; with k not divisible by Δ. This leads us 


to represent U(s) and B(s) as the sum of ἃ power series in s* 
(4.10) U(s) = Uo(s) + sUx(s) +...-+ τ ῖῦχ. 10) 
B(s) = Bo(s) + sBi(s) +...+ s*1By_1(s), 
where 
(4.11) U;(s) = Dy tnr4is”, Β,(8) = Σ) bans”. 
n=0 n==0 
Then, from (4.4) ἴον 2 = 0,1, ..., A—1, 
B,(s 
(4.12) U;(s) = Ue 
1 — A(s) 


Here all functions are power series in s*, and the preceding theorem 
applies after the change of variables s\ = ἐς. This leads to 


Theorem 2. In the periodic case with period ἃ the sequence {un} is 
asymptotically periodic; if A(1) = 1, each of the X subsequences {tn +,;} 


has a limit 
\B;(1) 


(4.13) lim Unr+j = 


where B;(1) = b; + by 4; + bor4; + b3sy 4; +- ae ἐς 


Example. Repeated averaging. Given three positive numbers 1, 
Ug, U3, define an infinite sequence {u,} by taking running arithmetic 
means 


(4.14) ug - ξί(ω tug - 15), Us = ξί(μ -Ἐ τ - 14), ...» 
Unt3 = (Un a Un Ἐ Un+2); 


We seek information concerning the asymptotic behavior of {up}. 
More precisely, we propose to show that 


(4.15) Un —> ξίω + 2ue + 313). 


Needless to say, the same argument will apply to arbitrary means (cf. 
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problems 5 and XV, 15). The point is that problems of this type are 
reducible to the renewal equation (4.1) and throw a new light on its 
nature. 

If we put 


(4.16) ao = 0, αι = 2 = a3 = ἢ, an =b,=0 forn > 4, 


then (4.14) and (4.1) agree for n > 4. To reduce (4.14) to (4.1) for 
all n we have to define bo = up = 0 and determine 64, be, bs from 


(4.17) by = uy, be = U2 — au, b3 = ug — 3 (uy + Ue). 


Now we can apply theorem 1(a) to obtain (4.15) without further eal- 
culations. Since the generating function U(s) is rational, we can ex- 
pand it into partial fractions to see that the limit in (4.15) is approached 
with exponential rapidity and to estimate the difference of the two 


sides. 
5. DELAYED RECURRENT EVENTS 


We shall now introduce a slight extension of the notion of recurrent 
events which is so obvious that it could pass without special mention, 
except that it is convenient to have a term for it and to have the basic 
equations on record. 

Perhaps the best informal description of delayed recurrent events is 
to say that they refer to trials where we have ‘missed the beginning 
and start in the middle.’ The waiting time up to the first occurrence 
of & has a distribution {b,} different from the distribution {f,} of the 
recurrence times between the following occurrences of &. The theory 
applies without change except that the trials following each occurrence 
of & are exact replicas of a fixed sample space which is not identical 
with the original one. 

The situation being so simple, we shall forego formalities and agree 
to speak of a delayed recurrent & when the definition of recurrent events 
applies only 1} the trials leading up to the first occurrence of & are disre- 
garded; τέ is understood that the waiting time up to the first appearance of 
& is a random variable independent of the following recurrence times, al- 
though its distribution {b,} may be different from the common distribution 
{fn} of the recurrence times. 

It is easy to calculate the probabilities u, of the occurrence of & at 
the nth trial directly from the definition above and the results of sec- 
tion 3. However, it is preferable to proceed independently and to write 
down a new equation of the renewal type. 

The probability that & occurs at trial number n — k and the next time 
at the nth trial equals un_xzf;. These events are mutually exclusive, 
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and their union for k = 1, 2, ..., n—1 is the event that & occurs at 
the nth trial and at some previous trial. The probability that § occurs 
at the nth trial for the first time equals b,, and hence for n > 1 


(5.1) Un = On + Un—ifi + Un—ofe +... + Urfn_t. 
For the delayed events it is most natural to set 
(5.2) Uo = fo = bo = 0; 
this reduces (5.1) to the renewal equation 
(5.3) {un} = {bn} + {un}*{ fa}, 
and the corresponding generating functions satisfy 
B(s) 
(5.4) U(s) = 1- Fo 


The results of the last section now contain as a special case the 


Theorem. {{8 is not periodic and if Zf, = 1 (that is, & is persistent), 


then 

(5.5) Un —> μ᾽ 12b, w= Dnfn. 
If f = Zfn «1 (that is, & ts transient), then 

(5.6) Lun = (1 — f)*2dp. 


In the periodic case theorem 2 of section 4 applies. 


Examples. (a) In the counter problem (1.0) suppose that at time 0 
the counter was locked for exactly two time units (in other words, the 
observations begin two trials after a registration). The counter is 
locked for at least r — 2 additional units and becomes free at trial 
number r — 1 if that trial results in failure; otherwise it registers and 
therefore remains locked for at least r additional trials, ete. It follows 
that 6,2 = 4, bere = pq, b3r_2 = pq, .... 

(b) Self-renewing aggregates. In the example of section 2 we have 
considered a piece of equipment whose lifetime is a random variable 
with distribution {f,}. When it expires, it is immediately replaced by 
a new piece, and the process continues in this way, & standing for 
“replacement at time n.”’ In section 2 we assumed that at time 0 a 
new piece of equipment is installed. Suppose, instead, that at time 0 
the age of the piece is k. Then & becomes a delayed recurrent event, 
and we have to calculate the probability distribution {b,} of the waiting 
time for the first replacement. Clearly ὃ, is the probability that a piece 
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of equipment will expire at age n + k, given that it has attained age k. 
Thus 


(5.7) Dn = Sn+k 


ee Te = She t+ ἵκει thepot.... 

In applications it is not natural to consider just one piece of equip- 
ment but a whole population. Suppose then that the initial population 
(at time 0) consists of N elements, among which exactly v, are of age Καὶ 
(where 2v, = Ν). Each element originates a line of descendants, and 
at any time n there is a certain probability that a replacement is re- 
quired in this line. The sum of these probabilities for all N elements 
is the expected number u,, of replacements at time n. Obviously un satis- 
fies the basic equation (5.3) with 


(5.8) b, = yn att, 


k=0 Tk 


and our theorems show that τι, will converge. 

It is easy to calculate not only the limit of wz but also the age dis- 
tribution at time n and its asymptotic behavior. Let v;,(n) be the ex- 
pected number of elements of age k at time n (so that v,(0) = »,). Clearly 


v;,(n) = Un—klk if k< n, 
(5.9) 
ὑχ--Υ 
υκ(η) = — : if k>n. 
Tk—n 


In the non-periodic case we know that u, — B(1)/u = N/nasn -- οὐ, 
and it follows from (5.9) that υμ(ι) -- Nr;/u. Hence, in the non- 
periodic case, there is a stable limiting age distribution: In the limit 
the expected number of elements of age & is Nr;/p, where N is the 
(constant) population size, and » = Zr; the mean duration of life (if 
μ = οὐ, then the population ages indefinitely). The basic fact is that 
the limiting age distribution 18 independent of the initial age distribution 
and depends only on the mortality distribution {a,} (cf. problems 17 
and 18). 

As a numerical illustration consider a population of N = 1000 ele- 
ments with the initial age distribution v9 = 500, 0, = 320, vg = 74, 
vg = 100, vg = 6. Assume the survival probabilities ἢ = 0.20, 
fo = 0.43, fg = 0.17, fs = 0.17, fs = 0.03 (so that 5 is the maximal 
age). Here U(s) is a rational function, 


397 + 332s + 159s? + 97s? + 1555 


5.10) U(s) = s——W———_________________., 
8:10) 08) 1 — 0.20s — 0.432 — 0.178? — 0.17s* — 0.038° 
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and can be expanded into partial fractions, 


1250s 972s 4 988 78,225582 + 22,1288 
9(1 -- 83. 61(1 Ὁ 85,5 87(1+ 57) 5307(1 + s?/4) 


The age distributions {v,(n)} for n = 1, 2, 3, ... may be calculated 
directly from the renewal equation. The columns of table 1 give these 


TABLE 1 


epenttte | eres | i ὁ ὁὃΟῸ ἠ ᾿...... | | EE: ὦ... . ....ὕ......-...ς. » β.᾽.. ---......-..--..... |r 


0 | 500 | 397 | 411.4 412 423.8 | 414.3 | 417.0 | 416.0 | 416.7 
1 | 320 400 | 317.6 | 329.1 | 329.6 | 339.0 | 331.5 | 333.6 | 333.3 
2 74 | 148 | 185 146.9 | 152.2 | 152.4 | 156.8 | 153.3 | 154.2 
3 | 100 40 80 100 79.4 | 82.3) 82.4] 84.8] 83.3 
4 6 15 6 12 15 11.9} 12.3] 12.4] 12.5 


age distributions {v,(n)} together with the limiting distribution and 
show that the approach to the limit is not monotonic. 

(c) Population theory. This theory is analogous to renewal theory, 
except that the population size is variable and female births play the 
role of replacements. The essential novelty is that a mother can have 
zero, one, or more daughters, so that lines may become extinct or 
branch. We now define a,, as the probability that a newborn female 
will survive and at age n give birth to a female child (the dependence 
on the number and ages of previous children is neglected). Then Za, 
is the expected number of daughters, and hence all three possibilities 
Zan <1, Dan = 1, Dan > 1 are now possible. The preceding argu- 
ment applies with this obvious modification. 


6. THE NUMBER OF OCCURRENCES OF ὃ 


Up to now we have considered the first, second, ..., rth occurrence 
of a recurrent event & and taken the number of trials as a random 
variable. Often it is more natural to take the opposite point of view, 
namely to fix the number n of trials and to consider the number N,, of 
occurrences of 8 in n trials as a random variable. We shall investigate 
the asymptotic behavior of N,, for large n. 
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As in (2.8) let T™ stand for the number of trials up to and including 
the rth occurrence of & The probability distributions of T® and N, 
are related by the obvious identity 


(6.1) P{N, > 7} = P{T™ <n}. 
We begin with the simple case where & is persistent and the distri- 
bution {f,} of its recurrence times has finite mean μ and variance o”. 


Since T 15 the sum of r independent variables, the central limit 
theorem (chapter X, section 1) asserts that for each fixed z asr ~ «© 


τ΄.) 
Ρ —" 


oT 


(6.2) « ἢ -» Φ() 


where ®(z) is the normal distribution function. Now let n — ὦ and 
7 — o in such a way that 


n— Th 
(6.3) — 12; 
or} 
then (6.1) and (6.2) together lead to 
(6.4) PiN, >r} — (2). 


To write this relation in a more familiar form we introduce the re- 
duced variable 
δι Se 
(6.5) N,, ager 
The inequality N, > r takes on the form 
1 


r— ny n—-rp (rp? 
(6.6) N,,* > as ees ae --- —_— . (*) ’ 
ontp or? n 


and (6.3) shows that the right side tends to —x. Thus 
(6.7) P{N,* > --αιἱ - ὁ. @®@) or  P{N,* < —z} -— 1 -- Gf), 
and we have proved the 


Theorem 1. Normal approximation. If the recurrent event ὃ 1s per- 
sistent and its recurrence times have finite mean μ and variance o”, then 
both the number T™ of trials up to the rth occurrence of & and the number 
N,, of occurrences of ὃ in the first n trials are asymptotically normally 
distributed as indicated in (6.2) and (6.7). 


Note that in (6.7) we have the central limit theorem applied to a 
sequence of dependent variables N,. The relations (6.7) make it plausi- 
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ble that 


no 


n 
(6.8) E(N,)~-: — Var(Nn) ~ > 
μ μ 


but an exact proof requires an additional argument. 

The usefulness of theorem 1 will be illustrated by an application to 
the theory of runs in the next section. However, it should be borne 
in mind that most recurrence times occurring in the fluctuation theory 
of random variables and in physical processes have infinite means, and 
that theorem 1 must be replaced by more general limit theorems.! 

We expect intuitively that E(N,) should always increase linearly 
with v [as in (6.8)], simply “because in twice as many trials 8 should 
occur twice as often.”” And yet this is not so. The return to the origin 
in coin tossing [example (1.c)] is typical of the recurrence times in dif- 
fusion theory and may once more serve as an example for the unex- 
pected features of fluctuations in general. 


Theorem 2. Hecurrence paradox. Let & be the return to the origin in 
symmetric Bernoulli trials (coin tossing). The expected number E(Non) 
of occurrences of & in 2n trials 1s given by 


(6.9) E(Non) = (2n + 1) (“1 2 | 
nN 

so that 

(6.10) E(Non) ~ 2(n/m)} 


(and E(N,,) is of the order of magnitude πὲ instead of increasing linearly 
with n). 


Proof. Recalling formula XI(1.8) we may calculate E(N,,) from the 
“tails” in (6.1) to obtain 


i?) 


(6.11) E(N,,) = SPIN, >r}= D PIT <n}. 


γεξεῖ r=] 


The generating function of T” is F"(s) where F(s) was found in (3.16) 
to be F(s) = 1 — (1 — s*)#. By theorem 1 of chapter ΧΙ, section 1, 
the generating function of the cumulative probabilities P{T™ < n} is 
F*(s)(1 — 5). and so by (6.11) the sequence {E(N,,)} has the generat- 


4W. Feller, Fluctuation theory of recurrent events, Transactions of the American 
Mathematical Society, vol. 67 (1949), pp. 98-119. 
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ing function 


ἫΝ F(s) Se ἘΨ 
(612) 2UE(Nn)s" = (1 — s)(1 — F(s)) Α-- 22. 1s 
It follows that 
—_ 3 
(6.13) E(Non) = E(Nen+1) = (τοῦ 7 dy 


and this has been rewritten in the form (6.9) for convenience [using 
11(12.5)]. 

The curious implications of this theorem have been discussed at 
length in chapter III, section 6. Theorem 2 of that section shows that 
N,,:n-” has an asymptotic distribution given by the positive part of 
the normal distribution. The normalization N,n~* stands in sharp 
contrast to that of theorem 1 above. 


ἘΠ, APPLICATION TO THE THEORY OF SUCCESS RUNS 


In the sequel r will denote a fixed positive integer and & will stand 
for the occurrence of a success run of length r in a sequence of Bernoulli 
trials. It is important that the length of a run be defined as stated in 
example (1.a), for otherwise runs are not recurrent events, and the 
calculations become more involved. As in (2.1) and (2.2), uw, is the 
probability of ὃ at the nth trial, and ἔμ 1s the probability that the first run 
of length r occurs at the nth trial. 

The probability that the 7 trials number n, n—1, n—2, ...,n—r-+1 
result in success is obviously p”. In this case & occurs at one among 
these r trials; the probability that & occurs at the trial number n — k 
(k = 0, 1, ..., r—1) and the following & trials result in success is 
μι. ρ΄. Since these r possibilities are mutually exclusive, we get the 
following recurrence relation: ὅ 


(7.1) Un + Unip +...+ Un—r4ip” = ρ΄. 
This equation is valid form > r. Clearly 
(7.2) Uy = Ug =... Ur-1 = Q, Up = 1. 


Now multiply (7.1) by 85 and sum over n = ἡ, r+1, r+2, .... In 


* Sections 7 and 8 treat a special topic and may be omitted. 

5 The classical approach consists in deriving a recurrence relation for f,. This 
method is more complicated and. does not apply to, say, runs of either kind or 
patterns like SSFFSS, to which our method applies without change [cf. example 


(8.c)]. 
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view of (7.2) we get on the left side 
(7.3) {U(s) -- 1}(1 + ps + ps? +...-+ »γ 125 Ὁ 


and on the right side p’(s” + 571: +...). Summing the two geo- 
metric series, we find 


(7.4) {U(s) — 1} - 
or 


(7.5) U(s) = 


T of 


1 — (ps)" p's 


1 — ps a ee 


1—s-+ qp’s*! 
(1 — s)(1 — p's’) 
Using equation (3.2), we get for the generating function of the recurrence 
tumes 
"s’(1 — ps δ᾽ 
(7.6) ΓΘ = oe ee ae 
1—s+aqp's"t? 1-—qs(1+ps+...+ p's") 

The fact that F(1) = 1 shows that in a prolonged sequence of trials 
the number of runs of any length is certain to increase over all bounds. 
The mean recurrence time yp can be obtained directly from (7.1) since 
we know that τ, — μ΄. If we require also the variance, it is prefer- 
able to calculate the derivatives of F(s). This is best done by implicit 
differentiation after clearing (7.6) of the denominator. An easy cal- 
culation then shows that the mean and variance of the recurrence times 
of runs of length r are 


(7.7) b= 


respectively. Theorem 1 of the last section implies that for large n 
the number N,, of runs of length r produced in n trials is approximately 
normally distributed, that is, for fixed a < 8 the probability that 

n aon n Boni 

μ μ μ μ 
tends to Φί(β) -- ®(a). This fact was first proved by von Mises, but 


TABLE 2 


Meran RECURRENCE TIMES FOR Success Runs 1F TRIALS ARE 
PERFORMED AT THE RATE OF ONE PER SECOND 


Length of Run p = 0.6 p = 0.5 (Coins) p = % (Dice) 
r= 5 30.7 seconds 1 minute 2.6 hours 
10 6.9 minutes 34.1 minutes 28.0 months 
15 1.5 hours 18.2 hours 18,098 years 


20 19 hours 24.3 days 140.7 million years 
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without the theory of recurrent events the proof requires rather 
lengthy calculations. Table 2 gives a few typical means of recurrence 
times. 

_ The method of partial fractions of chapter XI, section 4, permits us 
to derive excellent approximations. The second representation in (7.6) 
shows clearly that the denominator has a unique positive root s = 1. 
For every real or imaginary number with |s|< z we have 


(7.9) las. +ps+...+ ps" ἢ) « 
<qxe(l+ pxr+...¢p ἢ =1 


where the equality sign is possible only if all terms on the left have 
the same argument, that is, if s = x. Hence z is smaller in absolute 
value than any other root of the denominator in (7.6). We can, 
therefore, apply formulas (4.5) and (4.9) of chapter XI with 8; = x. 
The coefficient p; is easily computed with U(s) = p’s’(1 — ps) and 
V(s) =1—s-+qp's"t!. We find, using that V(s) = 0, 


(ς -- 4 --ρὴ 1 
πο rs τὴς τῊ 


The probability of no run in 7 trials is gn = fnti "Ἢ μια +Sn+3 +... 
Equation (7.10) approximates g, by a geometric series, and we get 


1 — pz 1 
(r+1—rz)q gn tl 


(7.11) Qn 


We have thus found that the probability of no success run of length 
r in 7 trials is, asymptotically, given by (7.11). Table 3 shows that 
TABLE 3 


PROBABILITY OF Havine No Success Run or LENGTH r = 2 IN ἢ 
TRIALS WITH p = 3 


Approxima- 
n dn Exact tion (7.11) Error 
2 0.75 0.76631 0.0163 
3 .625 .61996 .0080 
4 .500 . 50156 .0016 
5 .40625 .40577 .0005 


the formula gives surprisingly good approximations even for very small 
n, and the approximation improves rapidly with n. This illustrates 
the power of the method of generating function and partial fractions. 
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Numerical Calculations. For the benefit of the practical-minded reader we 
use this occasion to show that the numerical calculations involved in partial frac- 
tion expansions are often less formidable than they appear at first sight, and that 
excellent estimates of the error can be obtained. 

The asymptotic expansion (7.11) raises two questions: first, the contribution 
of the r — 1 neglected roots must be estimated, and second, the dominant root zx 
must be evaluated. 

The first representation in (7.6) shows that all roots of the denominator of F(s) 
satisfy the equation 


(7.12) s = 1 + gp's™T, 


although (7.12) has the additional extraneous root 8 = p~!. For positive 8 the 
graph of f(s) = 1 + gp’s"t! is convex; it intersects the bisector y = 8 at x and pt 
and in the interval between z and p—! the graph lies below the bisector. Further- 
more, f’(p—') = (r + Ig. If this quantity exceeds unity, the graph of f(s) crosses 
the bisector at 8 = p from below, and hence p~! > xz. To fix ideas we shall assume 
that 


(7.18) (r+ 1)¢ >1; 


in this case x < p~', and f(s) < sforz <s < p—. It follows that for all complex 
numbers 8 such that x < |s| < p~' we have |f(s)| </f({s|) « |s| so that no root 
s;, can lie in the annulus z < |s| <p ~'. Since z was chosen as the root smallest 
in absolute value, this implies that 


(7.14) [sz] > pot 
for each root 85 ~ x. By differentiation of (7 .12) it is now seen that all roots are 
simple. 


The contribution of each root to gn is of the same form as the contribution (7.11) 
of the dominant root z, and therefore the r—1 terms neglected in (7.11) are of the 
form 


Ly 1 
(7.15) ade 


Pee pe ee ae στ εις 
© ree — (+2) gsr tt 


We require an upper bound for the first fraction on the right. For that purpose 
note that for fixed s > p—! > (r + 1)r— 


eo 
pse” — 1 ps+l | 
(610) γ86 -- (γ - 1)} rs+r41’ 


in fact, the quantity on the left obviously assumes its maximum and minimum for 
6 = 0 and 6 = π, and a direct substitution shows that 0 corresponds to a mini- 
mum, 7 to a maximum. In view of (7.13) and (7.14) we have then 

Qprth 2p" +2 
ee ee < -----ο..-.. 
(Ὁ Ἐ1Ὶ Ἐνγρ ὃς τα! + p) 
We conclude that in (7.11) the error committed by neglecting the r—1 roots differen 
from x 18 less in absolute value than 
2(r — 1)» 
κα + p) 


(7.17) [4}}] < 


(7.18) 
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The root z is easily calculated from (7.12) by successive approximations putting 
zo = 1, ναι = f(t»). The sequence will converge monotonically to z, and each 
term provides a lower bound for x, whereas any value 8 such that s > f(s) provides 
an upper bound. It is easily seen that 


(7.19) z=1+qp" +(r + 1)(qp’)? +.... 
*8. MORE GENERAL PATTERNS 


Our method is applicable to more general problems which have been 
considered as considerably deeper than the theory of runs. 


Examples. (a) Runs of either kind. Let ὃ stand for “ezther a suc- 
cess run of length r or a failure run of length ρ. We are dealing with 
two recurrent events &; and 82, where δι stands for “‘success run of 
length γ᾽ and &2 for “failure run of length p” and & means “either δ. 
or 8..᾽ To δ, there corresponds the generating function (7.5) which 
will now be denoted by U,(s). The corresponding generating func- 
tion U.2(s) for &2 is obtained from (7.5) by interchanging p and g and 
replacing r by p.. The probability u, that & occurs at the nth trial is 
the sum of the corresponding probabilities for δ: and &, except that 
U = 1. It follows that 


(8.1) U(s) = Ui1(s) + ὕω0) — 1. 


The generating function F(s) of the recurrence times of & is again 
F(s) = 1 — U~\(s) or 


(1 — ps)p’s"(1 — q°s°) + (1 — gs)g?s?(1 — p's”) 
1— s+ p's! + pqrs?t! — ρ'η 8" 


The mean recurrence time follows by differentiation 


,- ἃ τρῦὰ - ὁ), 
ap" + pg? — »᾽φ 

As p — ©, this expression tends to the mean recurrence time of success 
runs as given in (7.7). 

(b) In chapter VIII, section 1, we calculated the probability xz that 
a success run of length r occurs before a failure run of length p. Define 
two recurrent events 81 and &2 as in example (a). Let x, = probability 
that &, occurs for the first time at the nth trial and no &2 precedes it; 
fn = probability that & occurs for the first time at the nth trial (with 
no condition on 82). Define y, and g, as 2, and fy, respectively, but 
with ὃ. and 85 interchanged. 

The generating function for f, is given in (7.6), and G(s) is obtained 
by interchanging p and g and replacing r by p. For x, and y, we have 


(8.2) F(s) = 


(8.3) 
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the obvious recurrence relations 
(8.4) tn = $n — Yifn—1 + Yofn—o2 +. + Yn—if't) 

Yn = θη -- (€19n—1 + L2Gn—2 +... + Xn_191). 


These equations are of the convolution type, and for the corresponding 
generating functions we have, therefore, 


(8.5) X(s) = F(s) — Y(s)F(s) 
Y(s) = G(s) — X(s)G(s). 


From these two linear equations we get 


_ F(s){1 — 60)} _ Hs) {1 — ΚΘ} 
Cae See F(s)G(s) OTT ae F(s)G(s) 


Expressions for x, and y, can again be obtained by the method of 
partial fractions. For s = 1 we get X(1) = 22x, = z, the probability 
of & occurring before &. Both numerator and denominator vanish, 
and X(1) is obtained from L’Hospital’s rule differentiating numerator 
and denominator: X(1) = G’(1)/{F’(1) + @’()}. Using the values 
F’(1) = (1 — p")/qp” and G’(1) = (1 — @)/pq? from (7.7), we find 
X(1) as given in equation VIII(1.3). 

(c) Consider the recurrent event defined by the pattern SSFFSS. 
Repeating the argument of section 7, we easily find that 


(8.7) pig? = Un + puns + p®q?un_s. 


1 we get for the mean recurrence time 


Since we know that u, — »_ 

p=pig?t+tp?t+p. For p = q = 4 we find » = 70, whereas 
the mean recurrence time for a success run of length 6 is 126. This 
shows that, contrary to expectation, there 1s an essential difference in 


coin tossing between head runs and other patterns of the same length. 


9. LACK OF MEMORY OF GEOMETRIC WAITING TIMES 


The geometric distribution for waiting times has an interesting and 
important property not shared by any other distribution. Consider a 
sequence of Bernoulli trials and let T be the number of trials up to and 
including the first success. Then P{T > k} = g*. Suppose we know 
that no success has occurred during the first m trials; the waiting time 
T from this mth failure to the first success has exactly the same dis- 
tribution {g"} and is independent of the number of preceding failures. 
In other words, the probability that the waiting time will be prolonged 
by k always equals the initial probability of the total length exceeding 
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k. If the life span of an atom or a piece of equipment has a geometric 
distribution, then no aging takes place; as long as it lives, the atom has 
the same probability of decaying at the next trial. Radioactive atoms 
actually have this property (except that in the case of a continuous 
time the exponential distribution plays the role of the geometric dis- 
tribution). Conversely, if it is known that a phenomenon is charac- 
terized by a complete lack of memory or aging, then the probability 
distribution of the duration must be geometric or exponential. Typical 
is a well-known type of telephone conversation often cited as the model 
of incoherence and depending entirely on momentary impulses; a pos- 
sible termination is an instantaneous chance effect without relation to 
the past chatter. By contrast, the knowledge that no streetcar has 
passed for five minutes increases our expectation that 1t will come soon. 
In coin tossing, the probability that the cumulative numbers of heads 
and tails will equalize at the second trial is 4. However, given that 
they did not, the probability that they equalize after two additional 
trials is only 4. These are examples for aftereffect. 

For a rigorous formulation of the assertion, suppose that a waiting 
time T assumes the values 0, 1, 2, ... with probabilities po, p1, po, .... 
Let the distribution of T have the following property: The conditional 
probability that the waiting tume terminates at the kth trial, assuming that 
it has not terminated before, equals po (the probability at the first trial). 
We claim that p, = (1 — po)*po, so that T has a geometric distribution. 

For a proof we introduce again the ‘“‘tails’’ 


Ok = Peti + Peto + Pega +...= P{T > k}. 


Our hypothesis is T > αὶ — 1, and its probability is q,_;. The condi- 
tional probability of T = & is therefore p;/q,—1, and the assumption 
is that for all k > 1 


Pk 7 
(9.1) —— = Po- 


Qk—1 


Now px = Qe—1 — x, and hence 


Since 40 = pi + po +...= 1 — po, it follows that g, = (1 — po)*T}, 
and hence py = qr_1 — Gz = (1 — po)*p0, as asserted. 

In the theory of stochastic processes the described lack of memory 
is connected with the Markovian property; we shall return to it in 
chapter XV, section 10. 
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*10. PROOF OF THEOREM 3 OF SECTION 3 


In section 3 we have omitted the proof of theorem 3. The latter 
can be formulated either as a ‘“Tauberian”’ theorem on power series 
or in an elementary way as follows. Given a sequence {fn} such that 
fo = 0, fn > 0, Zfn = 1, and that the greatest common divisor of those ἢ 
for which fr, > Ὁ 18 one. Let up = 1 and define uy for n > 1 by 


(10.1) Un = fitin—1 + Sotin—2 +... fntto 


Then un — 1/p, where μ = XInfy (and un — 0 if Unfn diverges). 
For the proof put 


(10.2) n= Fn+1 + In+2 Panay 
so that by formula XI(1.8) 
(10.3) b= ΣΥᾺ. 


From (10.2) we get 70 = 1, fi = 7 — "1, fo = 71 — Te, etc. Substi- 
tuting these values into (10.1), we find that rot, + rita. +...+ 
+ rao = Toln—1 1 TiUn—2 +.--+ Tn—1Uo. If the left side is called 
An, then the right side is An_1, and our equation states that all A, 
are equal. Now 40 = roto = 1, and hence A, = 1 for all n. Thus 
we have for every n | 


(10.4) Ton H+ TiUn—1 +... + tao = 1. 


From (10.1) it follows by induction that un < 1. Hence there exists 
a number ἃ = lim sup τη such that for any e > 0 and all sufficiently 
large n we have un, < + ε, and there exists some sequence 7, Ne, 
ng, ... such that τι, — Δ. Choose an integer 7 > 0 such that f; > 0. 
We claim that un,_; — λ. If this were not so, we could find arbitrar- 
ily large subscripts n such that simultaneously 


(10.5) Un >vN— &, Unj~<N «λ. 


Now let N be so large that ry < ε. Since uz < 1, we have then from 
(10.1) forn > N 


(10.6) Un “ foun + Int eee fnun—n + €. 


For sufficiently large n each uw; on the right side is less than ἃ + ε, and 


* This section should be omitted at first reading. 
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Un—j < »’. Hence | 
Un<Uotht-..-+faatihat.. timate + 
(10.7) +fnvw+eSd-fpataotfr te< 
<A+ 2 — "Ὰ — 2). 


If we choose ε so small that f;(A — Δ > 3e, then the last inequality 
contradicts the first one in (10.5), so that the assumption λ' < X is 
impossible. 

This proves that, whenever un, — A, also Un; — Δ. Repeating 
the argument, we see: If fj > Ὁ and un, — ἃ = lim sup wp, then also 


Un,—j τ A, Un,—2j τ A, Un,—3j —? A, ete. 


For simplicity let us first consider the case where f; > 0. Then we 
can take j = 1 and conclude that u, . — for every fixed k. From 
(10.4) we find for n = n, 


(10.8) 1» ToUn, + T1Un,—1 4+...+ ’NUn,—N- 


For fixed N every Un,~ — A, so that 1 Ὁ Aro +71 +...+ Ty). 
Since WN is arbitrary, we conclude that 1 > Av or\ < 1/p. This com- 
pletes the proof for the case where (10.3) diverges, for then u, — 0. 

If μ « , let y = himinfu,. The same argument shows that, for 
every sequence ἣν for which un, — vy, also un,~ — y. If N is large 
enough that ry < ε, then from (10.4) 


(10.9) . 1 < rota, +.--+ TNUn,_N + 6; 


herein un,_~ — 780 that 1 < (79 τ... ΥΝ)Ύ + cand hence py > 1. 
However, by definition, y <A. Therefore y = ἃ = 1/y, as was to 
be proved. ἊΝ 

There remains the case where ἢ = 0. Consider then the collection 
of all integers j for which f; > 0. Among them we can find a finite 
collection a, ὃ, δ, ..., m whose greatest common divisor is 1. We know 
that, when Un, — A, also Un,2a — A, Un,-yo — A, etc., for every fixed 
z>0,y>0,...,w > 0; hence also up —2a_yb—... wm πὸ ἃ. In other 
words, if an integer k is of the form k = za + yb+...+ wm with 
positive integers z, y, ..., 10, then upn,-% — Δ. Now it is known from 
elementary number theory that every integer k exceeding the product 
abe ... mcan be written in this form. This means that fork > abe... m 
we have tn, — Δ. To get the inequality (10.8) it suffices to apply 
(10.4) ton = nm» + ab ...m. The remaining part of the proof requires 
no change. 
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11. PROBLEMS FOR SOLUTION 


1. Suppose that F(s) is a polynomial. Prove for this case all theorems of 
section 3, using the partial fraction method of chapter XI, section 4. 

2. Let r coins be tossed repeatedly and let & be the recurrent event that for 
each of the r coins the accumulated number of heads and tails are equal. Is 
& persistent or transient? For the smallest 7 for which & is transient, estimate 
the probability that & ever occurs. 

3. Let {X;,} be a sequence of mutually independent random variables with 
the common distribution P{X, = a} = b/(a + δ), P{X, = —b} = a/(a+ δ), 
where a and ὃ are positive integers. Let & denote the event ὃ, = 0. Prove 
that & is a persistent event. 


4. Let {X,} be an arbitrary sequence of mutually independent random 
variables with a common distribution, and let ὃ stand forS, = 0,81 < 0, 
So< 0, ..., Srat< 0. Prove that & is a transient recurrent event except 
in the trivial case where P{X;, = 0} = 1. 

5. Repeated averaging. Modify the example of section 4 so as to permit 
arbitrary weighted averages and find the limit. 


Note: Problems 6-8 refer to Bernoulli trials with p = ᾳ = % (coin tossing). The 
generating function F(s) = 1 — (1 — 8522} for the return to zero is assumed to be known. 

6. Let & be the recurrent event S, = 0, Sn-1< 0. Find the generating 
function F;(s) of the recurrence times. 

7. Continuation. Find the generating function of the recurrence time for 
ladder points (example 1.d). (Note that this is the same as the waiting time for 
the first passage through 1 discussed in chapter XI, section 3.) 

8. Continuation. Prove the theorem: The probability that at the 2nth or 
(2n+1)st trial S, assumes a value not previously assumed (1.e., a first passage 
occurs) equals the probability that Son = 0. 


9. In the counter problem (1.6): (a) Find the generating function of the re- 
currence time. (What is its physical significance?) (6) If Z, is the number of 
registrations in the first n trials, find E(Z,) and Var(Z,). 

10. Counters of Type II differ from those in example (1.5) in that each success 
locks the counter for r time units (r — 1 trials following the success) so that 
a success during a locked period prolongs that period. Do problem 9 for such 
counters. 

11. Find an approximation to the probability that in 10,000 tossings of a 
coin the number of head runs of length 3 will lie between 700 and 730. 

12. In a sequence of tossings of a coin let & stand for the pattern HTH. 
Let r,, be the probability that & does not occur in 7 trials. Find the generating 
function and use the partial fraction method to obtain an asymptotic ex- 
pansion. 

13. In example (8.b) show that the expected duration of the game is 


μιμείμι + Be), 


where μι and pe are the mean recurrence times for success runs of length r and 
failure runs of length p, respectively. 
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14. The possible outcomes of each trial are A, B, and C; the corresponding 
probabilities are a, 8, y (a -ἰ β -ΕΎ Ξ 1). Find the generating function of 
the probability that in πὶ trials there is no run of length τ: (a) of A’s, (6) of 
A’s or B’s, (c) of any kind. | 

15. Continuation. Find the probability that the first A-run of length r pre- 
cedes the first B-run of length p and terminates at the nth trial. [Note that 
this problem does not reduce to that of example (8.b) with p = a/(a + 8), 
q = βία + B).] 


Note: The following problems refer to the renewal theory, specifically to example (5.0). 


16. Constancy of the population. For the quantities (5.9) prove by induction 
that Σ v,(n) = N for every n. 


17. If the mortality distribution is given by p, = ¢—p (with p + ᾳ = 1), 
find u,, and the limiting age distribution, assuming that the original population 
consists of N elements aged zero. 

18. An age distribution is called stationary if v(m) does not depend on n. 
Show that this is the case if, and only if, », = Cr, where C is a constant. 


19. Let & be a persistent aperiodic recurrent event. Assume that the re- 
currence time has finite mean μ and variance σ΄. Put gn = [κ.41 Ἐ 7.4.2 -Ἐ... 


and Τὰ = Qn41 + Qn42+.... Show that the generating functions Q(s) and 
R(s) converge for s = 1. Prove that 

1 R(s) 
11.1 (u - -) 85 -ΞΞ---- 
oe Στὰ) = 200) 
and hence that 

I) oe Ὁ μδ 

(11.2) Σ (un ) oa 


20. Let & be a persistent recurrent event and N, the number of occurrences 
of & ins trials. Prove that E(N,) = uy +...-+ u, and hence 


(11.3) E(N,) ~ " 
21. Continuation. Prove that 


E(N,”) = wy... Ue + ὩΣ; uj(uy +... Ur—s) 
2 ἘΞ 


and hence that Ε(Ν,2) is the coefficient of 87 in 


Fs) + F(s) 


(11.4) (i —s){i — F(s)}? 


(Note that this may be reformulated more elegantly using bivariate generating 
functions.) 
22. Let gz.n = P{Ni = n}. Show that qx,» is the coefficient of s* in 
| 1 — F(s)} 


(11.5) | F°(s) ce 
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Deduce that E(N,) and E(N,”) are the coefficients of 8 in 


F(s) 
oo a(t - ΤΩ] 
and (11.4), respectively. 


23. Using the notations of problem 19, show that 


ΤῸΝ νὸς Ree ESR One οὐνίνος 
(11.7) Gi-s{l—Fe) toa μα π ἠὲ μι - ΚΣ] 


Hence, using the last problem, conclude that 


Peo Sa 
μ 


(11.8) E(N,) = see  Ό8ὲ 


with e, — 0. 

24. Continuation. Using a similar argument, show that 
(r+ 2) -- 1], 267 — 2μ -- pw 
Ξ..--. chit ΒΗ κα 


(11.9) E(N,2) = a r+ ar, 


where a, remains bounded. Hence 


2 
(11.10) ᾿ Var(N,;) ~ τ r. 


25. In a sequence of Bernoulli trials let gn be the probability that exactly ἢ 
success runs of length r occur in k trials. Using problem 22, show that the 
generating function Q;(z) = Zgz,nv” is the coefficient of s* in 


1 — p's’ 
1 — s+ gp’s’t! — (1 — ps)p’s'x 
Show, furthermore, that the root of the denominator which is smallest in 


absolute value is 81 ~ 1 + gp’(1 — 2). 
26. Continuation. The Poisson distribution of long runs. If the number 
k of trials and the length r of runs both tend to infinity, so that kgp" -- δ, 
then the probability of having exactly n runs of length r tends to eA*/n!. 
Hint: Using the preceding problem, show that the generating function is 
asymptotically {1 + φρ — “)} ἢ ὦ eG), Use the continuity theorem 
of chapter XI, section 6. 


6 The theorem was proved by von Mises, but the present method is considerably 
simpler. 


CHAPTER XIV 


Random Walk and Ruin Problems 


1. GENERAL ORIENTATION 


The first part of this chapter is devoted to Bernoulli trials, and once 
more the picturesque language of betting and random walks is used to 
simplify and enliven the formulations. 

Consider the familiar gambler who wins or loses a dollar with prob- 
abilities p and g, respectively. Let his initial capital be z and let him 
play against an adversary with initial capital a — z, so that the com- 
bined capital is a. The game continues until the gambler’s capital 
either is reduced to zero or has increased to a, that is, until one of the 
two players is ruined. We are interested in the probability of the 
gambler’s ruin and the probability distribution of the duration of the 
game. This is the classical ruin problem. 

Physical applications and analogies suggest the more flexible inter- 
pretation in terms of the motion of a variable point or ‘particle’ on 
the z-axis. At time 0 this particle is at its initial position z, and at 
times 1, 2, 3, ... it moves a unit step in the positive or negative direc- 
tion, depending on whether the corresponding trial resulted in success 
or failure. The position of the particle at time n represents the gam- 
bler’s capital at the conclusion of the nth trial. The trials terminate 
when the particle for the first time reaches either 0 or a, and we describe 
this by saying that the particle performs a random walk with absorbing 
barriers at 0 and a. This random walk is restricted to the possible posi- 
tions 1, 2, ..., a—1; in the absence of absorbing barriers the random 
walk is called unrestricted. Physicists use the random-walk model as 
a crude approximation to one-dimensional diffusion or Brownian mo- 
tion, where a physical particle is exposed to a great number of mo- 
lecular collisions which impart to it a random motion. The case p > q 
corresponds to a drifé to the right when shocks from the left are more 
probable; when p = q = 3, the random walk is called symmetric. 

In the limiting case a — © we get a random walk on a semi-infinite 
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line: A particle starting at z > 0 performs a random walk up to the 
moment when it for the first time reaches the origin. In this formula- 
tion we recognize the first-passage time problem; it was solved by ele- 
mentary methods in chapter III (at least for the symmetric case) and 
by the use of generating functions in chapter IX, section 3 (see also 
problem XIII, 7). We shall recognize formulas previously obtained, 
but the present derivation is new.! 

In this chapter we shall use the method of difference equations which 
serves as an introduction to the differential equations of diffusion 
theory. This analogy leads in a natural way to various modifications 
and generalizations of the classical ruin problem, a typical and instruc- 
tive example being the replacing of absorbing barriers by reflecting and 
elastic barriers. To describe a reflecting barrier, consider a random 
walk in the interval (0, a) as defined before but with the modification 
that whenever the particle is at point 1 it has probability p of moving 
to position 2 and probability ᾳ to stay at 1. In gambling terminology 
this corresponds to a convention that whenever the gambler loses his 
last dollar it is generously replaced by his adversary so that the game 
can continue. The physicist imagines a wall placed at the point 4 of 
the z-axis with the property that a particle moving from 1 toward 0 
is reflected at the wall and returns to 1 instead of reaching 0. Both 
the absorbing and the reflecting barriers are special cases of the so-called 
elastic barrier. We define an elastic barrier at the origin by the rule that 
from position 1 the particle moves with probability p to position 2; with 
probability 6q τί stays at 1; and with probability (1 — 8)q it moves to 0 
and is absorbed (i.e., the process terminates). For ὃ = 0 we have the 
classical ruin problem or absorbing barriers, for 6 = 1 reflecting bar- 
riers. As 6 runs from 0 to 1 we have a family of intermediate cases. 
The greater 6 is, the more likely is the process to continue, and with 
two reflecting barriers the process can never terminate. 

Sections 2 and 3 are devoted to an elementary discussion of the 
classical ruin problem and its implications. The next three sections 
are more technical (and may be omitted); in 4 and 5 we derive the 
relevant generating functions and from them explicit expressions for 
the distribution of the duration of the game, etc. Section 6 contains 
an outline of the passage to the limit to the diffusion equation (the 
formal solutions of the latter being the limiting distributions for the 
random walk). 


1 Conversely, some of the new results can be proved also by the method of chap- 
ter III. For the solution of the ruin problem by infinitely many reflections see 
problems 7-9. 
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In section 7 the discussion again turns elementary and is devoted to 
random walks in two or more dimensions where new phenomena are en- 
countered. Section 8 treats a generalization of an entirely different 
type, namely a random walk in one dimension where the particle is 
no longer restricted to move in unit steps but is permitted to change 
its position in jumps which are arbitrary multiples of unity. Such 
generalized random walks have attracted widespread interest in con- 
nection with Wald’s theory of sequential sampling. 

In conclusion it must be emphasized that each random walk repre- 
sents a special Markov chain, and so the present chapter serves partly 
as an introduction to the next where several random-walk problems 
(e.g., elastic barriers) will be reformulated. 

The problem section contains essential complements to the text and 
outlines of alternative approaches. It is hoped that a comparison of 
the methods used will prove highly instructive. (Readers desiring to 
refer to the graphs and the text of chapter III are asked to visualize 
the time axis horizontally and the x-axis in the vertical position.) 


2. THE CLASSICAL RUIN PROBLEM 


We shall consider the problem stated at the opening of the present 
chapter. Let gz be the probability of the gambler’s ultimate? ruin 
and p, the probability of his winning. In random-walk terminology 
qz and p, are the probabilities that a particle starting at z will be ab- 
sorbed at 0 and a, respectively. We shall show that p. + 4, = 1, so 
that we need not consider the possibility of an unending game. 

After the first trial the gambler’s fortune is either z — 1 or 2 + 1, 
and therefore we must have 


(2.1) Gz = PQz+1 + 44:--ἰ 


provided 1 <<z<a-—1. For z = 1 the first trial may lead to ruin, 
and (2.1) is to be replaced by g1 = pg2 + ᾳ. Similarly, forz = a —1 
the first trial may result in victory, and therefore ga_1 = q@a—2. To 
unify our equations we define 


(2.2) g=1, =O. 


2 Strictly speaking, the probability of ruin is defined in a sample space of infi- 
nitely prolonged games, but we can work with the sample space of n trials. The 
probability of ruin in less than n trials increases with n and has therefore a limit. 
We call this limit “the probability of ruin.”’ All probabilities in this chapter may 
be interpreted in this way without reference to infinite sample spaces (cf. the 
introduction to chapter VITI). 
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With this convention the probability g, of ruin satisfies (2.1) for 
2=1,2,...,a—1. 

Equation (2.1) is a difference equation, and (2.2) represents the bound- 
ary conditions on gz. We shall derive an explicit expression for 4, by 
the method of particular ee which will also be used in more gen- 
eral cases. 

Suppose first that p ~q. It is easily verified that the difference 
equation (2.1) admits of the two particular solutions g, = 1 and 

= (¢/p)*. It follows that for arbitrary constants A and B the 
sequence | 


(2.3) q=A+B (2) 
Pp 


represents a formal solution of (2.1). The boundary conditions (2.2) 
will hold if, and only if, A and B satisfy the two linear equations 
A+B=1andA-+ B(g/p)* = 0. Thus 


(q/p)* — (q/p)’ 
(q/p)* -- 1 


is a formal solution of the difference equation (2.1), satisfying the 
boundary conditions (2.2). In order to prove that (2.4) is the required 
probability of ruin it remains to show that the solution is unique, that 
is, that all solutions of (2.1) are of the form (2.3). Now, given an 
arbitrary solution of (2.1), the two constants A and B can be chosen 
so that (2.3) will agree with it for z = 0 and z = 1. From these two 
values all other values can be found by substituting in (2.1) succes- 
sively 2 = 1, 2,3, .... Therefore two solutions which agree for z = 0 
and z = 1 are identical, and hence every solunien is of the form (2.3). 

Our argument breaks down if p = q = 34, for then (2.4) is meaning- 
less because in this case the two formal particular pene qz = land 

= (q/p)’ are identical. However, when p = g = 5 we have a sec- 
ond solution in g, = 2, and therefore g, = A + Bzisa ‘aolutian of (2.1) 
depending on two constants. In order to satisfy the boundary condi- 
tions (2.2) we must put A = land A+ Ba=0. Hence 


Ζ 
(2.5) qd Ζ = 1 -- ... 
a 
(The same numerical value can be obtained formally from (2.4) by 
finding the limit as p — 4, using L’Hospital’s rule.) 
We have thus proved that the required probability of the gambler’ 8 
ruin is given by (2.4) if p ¥ q, and by (2.5) if p = q = 4. The prob- 
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ability p, of the gambler’s winning the game equals the probability of 
his adversary’s ruin and is therefore obtained from our formulas on 
replacing p, 4, and z by 4, p, and a — 2, respectively. It is readily 
seen that p, + 42 = 1, as stated previously. 

We can reformulate our result as follows: Let a gambler with an 
initial capital z play against an infinitely rich adversary who 1s always 
willing to play, although the gambler has the privilege of stopping at his 
pleasure. The gambler adopts the strategy of playing until he either loses 
his capital or increases it to a (with a net gain a — 2). Then q; 1s the 
probability of his losing and 1 — ας the probability of his winning. 

Under this system the gambler’s ultimate gain or loss is a random 
variable G which assumes the values a — z and —z with probabilities 
1 — 4; and qz, respectively. The expected gain 15 


(2.6) E(G) = a(1 — qz) — Ζ. 


Clearly E(G) = 0 if, and only if, p = gq. This means that, with the 
system described, a “‘fazr’’ game remains fair, and no “unfair”? game 
can be changed into a “‘fatr’’ one. 

From (2.5) we see that in the case p = q a player with initial capital 
z = 999 has a probability 339% to win a dollar before losing his capital. 
With g = 0.6, p = 0.4 the game is unfavorable indeed, but still the 
probability (2.4) of winning a dollar before losing the capital is about 2. 
In general, a gambler with a relatively large initial capital z has a rea- 
sonable chance to win a small amount a — z before being ruined.’ 

Let us now investigate the effect of changing stakes. Changing the 
unit from a dollar to a half-dollar is equivalent to doubling the initial 
capitals. The corresponding probability of ruin q,* is obtained from 
(2.4) on replacing z by 22 and a by 2a: 


" (ᾳ,}»)" — (ᾳφ")»)" ΠΝ (q/p)* + (a/p)* 
(q/p)** — 1 "  (q/p)*+1 


For q > p the last fraction is greater than unity and g,* > q,. We 
restate this conclusion as follows: If the stakes are doubled while the 
initial capitals remain unchanged, the probability of ruin decreases for 


(2.7) q.* 


$A certain man used to visit Monte Carlo year after year and was always suc- 
cessful in recovering the costs of his vacations. He firmly believed in a magic 
power over chance. Actually his experience is not surprising. Assuming that he 
started with ten times the ultimate gain, the chances of success in any year are 
nearly τ. The probability of an unbroken sequence of ten successes is about. 
(1 — zp)! ~ e! ~ 0.37. Thus continued success is by no means improbable. 
Moreover, one failure would, of course, be blamed on an oversight or momentary 
indisposition. 
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the player whose probability of success is p < 4 and increases for the 
adversary (for whom the game is advantageous). Suppose, for example, 
that Peter owns 90 dollars and Paul 10, and let p = 0.45, the game 
being unfavorable to Peter. If at each trial the stake is one dollar, 
table 1 shows the probability of Peter’s ruin to be 0.866, approximately. 


TABLE 1 


ILLUSTRATING THE CLASSICAL Ruin PROBLEM 


Probability of Expected 
Pp 4 Ζ α 
Ruin Success Gain Duration 
0.5 0.5 9 10] 0.1 0.9 0 9 
5 5 90 100 1 9 0 - 900 
5 5 900 1,000 1 9 0 90 , 000 
5 5 950 1,000 .05 .95 0 47,500 
5 5 [8,000 10,000 2 8 0 16,000,000 
.45 .55 9 10 .210 .790 --1.1 11 
.4 .55 00 100 .866 .134 — 76.6 765.6 
45 .55 99 100 .182 .818 —17.2 171.8 
4 .6 90 100 .983 .017 —88.3 441.3 
4 .6 99 100 .333 . 667 — 32.3 161.7 


The initial capital is z. The game terminates with ruin (loss 2) or capital a 
(gain α — 2). 


If the same game is played for a stake of 10 dollars, the probability of 
Peter’s ruin drops to less than one fourth, namely about 0.210. Thus 
the effect of increasing stakes is more pronounced than might be ex- 
pected. In general, if k dollars are staked at each trial, we find the 
probability of ruin from (2.4), replacing z by z/k and a by a/k; the 
probability of ruin decreases as k increases. In a game with constant 
stakes the gambler therefore minimizes the probability of ruin by 
selecting the stake as large as consistent with his goal of gaining an 
amount fixed in advance. The empirical validity of this conclusion 
has been challenged, usually by people who contended that every ‘‘un- 
fair’”’ bet is unreasonable. If this were to be taken seriously, it would 
mean the end of all insurance business, for the careful driver who in- 
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sures against liability obviously plays a game that is technically “un- 
fair.’ Actually, there exists no theorem in probability to discourage 
such a driver from taking insurance. 


3. EXPECTED DURATION OF THE GAME 


The probability distribution of the duration of the game will be de- 
duced in the following sections. However, its expected value can be 
derived by a much simpler method which is of such wide applicability 
that it will now be explained at the cost of a slight duplication. 

We are still concerned with the classical ruin problem formulated at 
the beginning of this chapter. We shall assume as known the fact that 
the duration of the game has a finite expectation D,. A rigorous proof 
will be given in the next section. 

If the first trial results in success the game continues as if the initial 
position had been z + 1. The conditional expectation of the duration 
assuming success at the first trial is therefore D,4;, + 1. This argu- 
ment shows that the expected duration D, satisfies the difference equa- 
tion 


(3.1) D, = pD241 + gD,-1 + 1, 0 « 2 « α 
with the boundary conditions 
(3.2) Do = 0, D, => 0. 


The appearance of the term 1 makes the difference equation (3.1) 
non-homogeneous. If p τέ ᾳ, then Ὁ, = z2/(q — p:p) 15 a formal solution 
of (3.1). The difference A, of any two solutions of (3.1) satisfies the 
homogeneous equations A, = pAz4; + gA,z_1, and we know already 
that all solutions of this equation are of the form A + B(q/p)’*. It 
follows that when p = ᾳ all solutions of (8.1) are of the form 


(3.3) ν»,-τ-Ξ +4 +B(4), 
q—-p Ρ 


The boundary conditions (3.2) require that A+ B=0 and A+ 
+ B(q/p)* = —a/(q — p). Solving for A and B, we find 


(3.4) Pe ene τ0π|᾿ κα 


q-p a-—p 1—(@/p)* 
Again the method breaks down if g = p = 3. In this case we must 
replace z/(q — p) by —z”, which is now a solution of (3.1). It follows 
that when p = ᾳ = ξ all solutions of (3.1) are of the form Ὁ, = —z2? + 
+ A+ Bz. The required solution D, satisfying the boundary condi- 
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(3.5) D. =e z(a τ 2). 
The expected duration of the game in the classical ruin problem is given 
by (3.4) or (3.5), according as p τέ gorp =q = $. 


It should be noted that this duration is considerably longer than we 
would naively expect. If two players with 500 dollars each toss a coin 
until one is ruined, the average duration of the game is 250,000 trials. 
If a gambler has only one dollar and his adversary 1000, the average 
duration is 1000 trials. Further examples are found in table 1. 


Passage to the limit a — . If in the formulas (2.4) and (2.5) for the prob- 
ability of ultimate ruin we let a — οὐ, we find that 


1 if g>p 
(q/p)* if q<p 


Nobody would hesitate to interpret these limits as probabilities of ruin in a game 
against an infinitely rich adversary, but axiomatically a random walk on the semi- 
infinite interval (0, 99) should be considered on its own merits. Now saying that in 
such a random walk a particle starting at z > 0 reaches the origin is really the 
same as saying that in an unrestricted random walk a particle reaches a position 
z units to the left from its starting point. This probability has been calculated in 
chapter XI, section 3, and agrees with (3.6): In a game against an infinitely rich 
adversary the probability of ruin is one if g > p and (q/p)* if g < p. In the sec- 
ond case there is no sense in talking about the expected duration of the game since 
the game may go on forever. When q > p we get for the expected duration of 
the game the limit 2(q — p)—!, and if g = p= the limit is infinite. This agrees with 
our knowledge that in a symmetric random walk all first-passage times have an 
infinite expectation. (An independent derivation of these results is contained in 
the next section.) 


*4, GENERATING FUNCTIONS FOR THE DURATION OF 
THE GAME AND FOR THE FIRST-PASSAGE TIMES 


We shall use the method of generating functions to study the dura- 
tion of the game in the classical ruin problem, that is, the restricted 
random walk with absorbing barriers at Ὁ and a. The initial position 
is 2 (withO <z< a). Let uz, denote the probability that the process 
ends with the nth step at the barrier 0 (gambler’s ruin at the nth 
trial). After the first step the position is z + 1 or z — 1, and we con- 
clude that forl <<z<a-—landn>1 


(3.6) Qe -Ὁ 


(4.1) Ue n41 = PUzsin + Quz—1,n- 


This is a difference equation analogous to (2.1), but depending on the 


* This section together with the related section 5 may be omitted at first reading. 
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two variables z and n. In analogy with the procedure of section 2 we 
wish to define boundary values won, Ua,n, and Uz,9 80 that (4.1) becomes 
valid also for 2 = 1,2 =a-—1,andn=0. For this purpose we put 


(4.2) Uo,n = Uan = 0 when n2>l 
and 
(4.3) Uo,o = 1, Uz,0 = 0 , when z> 0. 


Then (4.1) holds for all 2 with0O <z<a and all n > 0. 
We now introduce the generating function 


(4.4) U,(s) = eae 

n=0 
Multiplying (4.1) by s+? and adding for n = 0, 1, 2, ..., we find 
(4.5) U,(s) = psU 241(8) Σ ᾳϑῦ,.. (8), 0 < z < a 
and equations (4.2) and (4.3) lead to the boundary conditions — 
(4.6) Uo(s)=1, Uals) = 0. 


Equation (4.5) is a difference equation analogous to (2.1), and the 
boundary conditions (4.6) correspond to (2.2). The novelty lies in the 
circumstance that the coefficients and the unknown U,(s) now depend 
on the variable s, but as far as the difference equation is concerned, s 
is merely an arbitrary constant. We can again apply the method of 
section 2 provided we succeed in finding two particular solutions of 
(4.5). It is natural to inquire whether there exist two solutions U,(s) 
of the form U,(s) = \*(s). Substituting this expression into (4. oy we 
find that X(s) must satisfy the quadratic equation 


(4.7) A(s) = psd*(s) + gs, 
which has the two roots 
1+ (1 — 4pqs”)! 1 — (1 — 4pes?)t 
ae EE, a 
2ps 208 


(we take 0 « 8 «1 and the positive square root). 
We have thus found two particular solutions of (4.5) and conclude 
as in section 2 that for arbitrary functions A(s) and B(s) 


(4.9) U,(s) = A(s)\17(s) + B(s)A2*(s) | 
is a solution of (4.5). To satisfy the boundary conditions (4.6), we 


820 RANDOM WALK xiv 
must have A(s) + B(s) = 1 and A(s)A1°(s) + B(s)Ao%(s) = 0, or 
A17(s)A27(8) — Ax7(8)AQ7(s) 

λι (8) — λο" (5) 


Using the obvious relation λ1 (5)λα(8) = q/p, the last formula simplifies 
to 


_ (I M7) - λα Ὁ 
(4.11) vars (2) “d%(s) — λυ) 


This is the required generating function of the probability of ruin at the 
nth trial (absorption at 0). The corresponding generating function for 
the probability of absorption at a is obtained on replacing p, 4, z by 
4, p, and a — 2, respectively. The generating function of the duration’ 
of game is, of course, the sum of the two generating functions. 


The Case a = «© 


Our method applies equally to the case a = © which corresponds to 
a random walk on (0, ©) with an absorbing barrier at the origin (or 
playing against an infinitely rich adversary). We have now the sole 
boundary condition Uo(s) = 1. All solutions of (4.5) are of the form 
(4.9), but since λι(8) > 1 and Ag(s) < 1 for 0 < 8 < 1, we find that 
U,(s) is unbounded unless A(s) = 0. Hence the required solution is 


(4.12) V (8) = Χο). 


This is the generating function of the probability that, starting from a 
point z > 0, the particle will be absorbed at the origin exactly at the nth 
trial. 

In other words, in an unrestricted random walk (4.12) is the generating 
function of the ΕΝ of first-passage times through a point z units 
to the left from the initial position. To get a formula for the first pas- 
sages to the mght we have only to interchange p and g. A glance at 
(4.8) will show that in an unrestricted random walk starting from the 
origin the first-passage times through a point z > 0 have the generating 
functions 


(4.13) \7(s) = (? r2(0)) τε λ, *(s). 


(4.10) U,(s) = 


For the particular value z = 1 we find X(s) as the generating func- 
tion for the first passages one unit to the right. The first passage from 
0 through an arbitrary z > 1 is the sum of the first-passage times from 
0 to 1, from 1 to 2, ..., from z — 1 to z, and is therefore the sum of z 
independent random variables each having the generating function 
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A(s). This explains why in (4.13) we find the zth power of a generating 
function. 

Substituting s = 1 into (4.18), we find the probability of ruin in the 
case of an infinitely rich adversary. It is (q¢/p)’ or 1, according as 
qsporg>p. 


*5. EXPLICIT EXPRESSIONS 


We shall now derive an explicit formula for uz, by expanding U,(s) 
into partial fractions. Formally, the expression (4.11) for U,(s) de- 
pends on a square root, but in reality U,(s) is a rational function. In 
fact, expanding the expressions (4.8) according to the binomial theorem, 
we see that the difference \,"(s) — Χο (8) is a rational function in 8 
multiplied by (1 — 4pqs”)?; this root appears as a factor in both the 
numerator and the denominator of (4.11), and hence U,(s) is the ratio 
of two polynomials. The degree of the denominator is a — 1 for a odd 
and a — 2 for a even; the degree of the numerator is a — 1 when 
a — zis odd and a — 2 when a — ziseven. In no case can the degree 
of the numerator exceed the degree of the denominator by more than 
one. Hence for n > 1 we can compute u,,, from equation XI(4.8), 
provided all the roots of the denominator are distinct. 

We could calculate the roots of the denominator and the correspond- 
ing coefficients p, directly, but the algebra simplifies if we introduce a 
new independent variable ¢ by 


(5.1) —— = 2(pq)}s. 
cos φ 
From (4.8) we find 
q\' q\i ςς 
(5.2) A1,2(s) = (2) (608 φ Ξῇ 1 5βη φ) = (2 ex? 
P P 
and hence from (4.11) 
42 (ἡ oa 
(5.3) va = (4) eee. 
p sin ad 


The roots of the denominator are obviously ¢ = 0, πα, 2m/a, .. 
The corresponding values of s are 


1 
5.4 a ae 
(5. ; 2(pq)? cos νπ΄α 
We get all possible values for s,, putting ν = 0, 1, ..., a. However, 


to ν = 0 and » = a there correspond the extraneous values ¢ = 0, π, 
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which are also roots of the numerator in (5.3), and if a is even, no 
number s, corresponds to v = $a. Hence, when a is odd, we get all 
a — 1 roots ὃν, putting v = 1, 2, ..., a—1; when a is even, the value 
ν = 4a must be omitted. We should disregard those s, which are also 


roots of the numerator, but for them (5.6) leads automatically to 


py = 0. 
We know that 
| 4? sin (a — aa 
(5.5) (2) NE ag pee Oe 
Pp sin ad 81 π 8 Sa—1 — 8 


To find p, multiply both sides by ὃν — s and let 8 —> s,. We get (put- 
ting ¢, = mv/a) as in formula XI(4.5) 


" (3) sin (α -- z2)rv/a 


ρ a-cos vr: (αφ͵ (8) ss, 


(5.6) Py 


(2)" sin zrv/a-sin rv/a 
7 p/ 2a(pq)* cos? v/a 


Hence we get finally from (5.5) for the coefficient uz, of s” when 
n> 1 | 


a—l 
πν πν πᾶν 
(5.7) τς, = A 12 ptr—D (ἡ τ) δ eos"! — - sin — - sin —- 


(Strictly speaking, the term ν = 3a should be omitted when a is even 
but it is zero anyway and therefore does no harm.) 

For n > 1 formula (5.7) represents the probability of ruin (absorption) 
at the nth trial. It goes back to Lagrange and has been derived in many 
different ways. Despite an honorable history and its availability in 
textbooks, the formula is rediscovered at frequent intervals. For an 
alternative explicit expression see problem 138; for limiting forms see 
section 6 and problem 14 (analogous formulas for reflecting barriers 
are derived in chapter XVI, section 3). 

If we let a —> ©, the sum in (5.7) may be interpreted as a Riemann 
sum approximating an integral. In this way we find that in a game 
against an infinitely rich adversary (single absorbing barrier at 0) the 
probability wz, that a player with initial capital 2 > 0 will be ruined 


4 An elementary derivation using trigonometric interpolation was given by Ellis, 
Cambridge Mathematical Journal, vol. 4 (1844), or The Mathematical and Other 
Writings of R. E. Ellis, Cambridge and London, 1863. 
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exactly at the nth step is 


1 
(5.8) Wan = 2: ρ΄ ὁ gan te) f cos”! rz sin πὰ sin rxz-dz. 

0 
This integral can be expressed in an elementary way ° as follows 


(5.9) W2n = =( ὠ ) pia) ἀπε). 
n \3(n — 2) 


where the binomial coefficient is to be interpreted as zero if $(n — z) 
is not an integer of the interval [0,7]. The corresponding generating 
function was found to be d27(s) (see end of section 4). 


6. PASSAGE TO THE LIMIT; DIFFUSION PROCESSES 


It has already been pointed out that our random-walk models serve 
as a first approximation to the theory of diffusion and Brownian motion, 
where small particles are exposed to a tremendous number of molecular 
shocks. Each shock has a negligible effect, but the superposition of 
many small actions produces an observable motion. Accordingly, we 
now want to study random walks where the individual steps are ex- 
tremely small and occur in very rapid succession. In the limit the 
process will appear as a continuous motion. The point of interest is 
that in passing to this limit our formulas remain meaningful and agree 
with physically significant formulas of diffusion theory which can be 
derived under much more general conditions by more streamlined 
methods.’ This explains partly why the random-walk model, despite 
its crudeness, describes diffusion processes reasonably well; only the 
limiting case 15 physically significant, and various discrete models lead 
to the same limiting formulas. The situation is in many ways analogous 
to the conditions of the central limit theorem where the cumulative 


’For p = q = 2 formula (5.9) reduces to the formula III(4.11) for the first- 
passage time distribution. It is by no means easy to verify that (5.8) and (5.9) 
agree. Perhaps the simplest way is to show that both formulas represent solutions 
of the difference equation (4.1) with the boundary conditions (4.2)-(4.3) at the 
origin. 

6 The limiting formulas of the present section agree with those of the now classical 
Einstein-Wiener theory. The newer, more refined theories (Uhlenbeck, Ornstein) 
are not considered here. Credit for discovering the connection between random 
walks and diffusion is due principally to L. Bachelier (1870- ). His work is fre- 
quently of a heuristic nature, but he derived many new results. Kolmogorov’s 
theory of stochastic processes of the Markov type is based largely on Bachelier’s 
ideas. See in particular L. Bachelier, Calcul des probabilités, Paris, 1912. 
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effect of many chance components is practically independent of the 
nature of the individual components. 

Let us begin with an unrestricted random walk starting at the origin, 
and let vz,, be the probability that the nth step takes the particle to the 
position x. If r among the n steps are directed to the right, n — r are 
directed to the left, and the total displacement isr — (n — r) = 2r —n 
units. This displacement can equal x only if n and zx are either both 
even or both odd (which means that after an even number of steps the 


| n 
abscissa x is an even integer). Out of n steps r can be selected in ( : 
r 


ways, and therefore 


nN 
(6.1) Ven = ( ) pinto, 
3(n + 2) 


again the binomial coefficient should be interpreted as 0 whenever 
+(n + 2) is not an integer in the interval [0, 7]. 

An alternative way of deriving (6.1) uses the argument which led to 
the difference equation (4.1) and the boundary conditions (4.2) and 
(4.3). It can be verified that v,,, must satisfy the difference equation 


(6.2) Vz, n+1 = Ρ}ῦ,--1 ἡ + QUz4+1,n 
with the boundary conditions 
(6.3) vo,0 = I, Vr9 = 0 for «+0. 


Given (6.3), we put in (6.2) successively n = 1, 2, ... and get first all 
values vz,1, and then successively vz,9, ¥z,3, .... This shows that the 
conditions (6.2) and (6.3) uniquely determine v,,,. On the other hand, 
it is readily seen that (6.1) is a solution. 

Let us now change the unit of length so that each step has length Ax 
and suppose that the time between any two consecutive steps is At. During 
time ¢ the particle performs about ¢/At jumps, and a displacement zx is 
now equivalent to x/Az units. Only multiples of Az and At represent 
meaningful coordinates, but in the limit Ar — 0, At — 0 every dis- 
placement and all times become possible. | 

We must not expect sensible results if Az and At approach zero in an 
arbitrary manner, for the maximum possible displacement in time ἐ 
amounts to tAz/At, so that in the limit no motion exists if Ar/At — 0. 
Physically speaking, we must keep the z- and ¢-scales in an appropriate 
ratio or the process will degenerate in the limit, the variances tending 
to zero or infinity. To find the proper ratio note that the total dis- 
placement during time ¢ is the sum of about ¢/At mutually independent 
random variables each having the mean (p — q)Az and variance 
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{1 — (p — q)*}(Azr)? = 4pq(Ar)?. The mean and variance of the total 
displacement in time ¢ are therefore about t(p — q)Az/At and 
Angt(Azx)*/At, respectively. To obtain reasonable results we must let 
Az and At approach zero in such a way that they remain finite for all ¢. 
The finiteness of the variance requires that (Azx)?/At should remain 
bounded; the finiteness of the mean implies that » — g must be of the 
order of magnitude of Av. This suggests putting 


τ (Az)? ΣῊ ΠΕ Ὁ 1 a 
δ΄ “ are a); a ny; on 


where D and ¢ are constants. The value of D introduces only a scale 
factor; for mathematical simplicity it is best to put D = 1, but we keep 
D unspecified to facilitate comparison with physical theories. The 
constants D and 6 are, respectively, the diffusion coefficient and the 
drift. If c = 0, the random walk is symmetric; in general, the sign of 
c determines the direction of the drift. In the limit p and g approach 
5; with any other norming the particle would drift away so fast that 
the probability of finite displacements would tend to zero. 

We use the norming (6.4) to pass to the limit Ar — 0, At — 0. 
The total displacement at time ἐ ~ nAi is determined by n Bernoulli 
trials, and therefore the limiting form of v,,, is given by the normal 
distribution. For a fixed Az the displacement is the sum of finitely 
many independent variables, and its mean is t(p — q)Ax/At = 2ct; its 
variance 4pqt(Az)”/At = 2Dt. Therefore the probability that at time t 
the displacement lies between xp and x1 (tp < 11) tends to 


Vi 2 
(6.5) (2x) f e?" dr 
yo 
where yi = (σι — 2ct)/(2Dt)! and yo = (xo — 2ct)/(2Dt)}. 

As for equation (6.2), we pass to the usual functional notation and 
write it in the form v(z, t+ Al) = p-v(x—Az, t) + q-v(a+ Az, ὃ. Ex- 
panding according to Taylor’s theorem up to terms of second order, 
we get formally 


dv(z, ὃ eee dv(z, t) ᾿ (Ax)? δέν(α, ὃ 


6.6) Δί- 
(00) at Ox 2 Ox? 


Using (6.4), we get in the limit 


Ov(x,t) ὁ v(x, ὃ rn d7v(zx, ὃ 


6.7 6 
(6. ot Ox Ox? 
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This is the Fokker-Planck equation for diffusion with drift, which can 
be derived from more general and more convincing assumptions. In 
the usual theory, the solution (6.5) is derived from (6.7), but we have 
obtained both results by the same limiting process. Our procedure is 
only heuristic but can be justified rigorously. All formulas of the dis- 
crete random walk permit a similar passage to the limit. 

As a further example, consider the limiting form of the probabilities 
for the first passage. For simplicity let us first consider formula (5.9) 
which corresponds to a single barrier. Of the two quantities wzn 
and Wz,.n+41, one is necessarily zero. The sum wz,, + Wz,n+1 represents, 
asymptotically, the probability of absorption during the time interval 
(t, +2At). We shall show that wzn + Wzn41 ~ f(z, t)(2At), where 
f(z, t) is a continuous function. Then the limiting probability of ab- 
sorption within any time interval (8, f2) is the integral of f(z, ἢ) ex- 
tended over that interval. When ἢ — z is even, we have Wz.n41 = 0, 
and to find f(z, ¢) we must replace z in (5.9) by z/Az and n by ¢/At, 
and apply (6.4). Using the normal approximation to the binomial dis- 
tribution and the last equation (6.9), we find easily ’ 


op Hlet2et)?/Dt 

(6.8) f(z, ὃ Σαβα 6 . 

This is the limiting form of (5.9); again it coincides with the corre- 
sponding formula of diffusion theory. In fact, it is easily verified that 
- f(—z, ὃ is a solution of (6.7). (In the definition of w,., the variable 
z plays the role of -- in v,,y.) 

A similar argument applies to (5.7). An inspection of this formula 
shows that the contributions of ν = k and vy = a — k cancel if n — z 
is odd and add if n — zis even. Hence we get the limiting form of 
S(@, ὃ ~ (lan + Uzn41)/(2At) by extending in (5.7) the sum twice 
over 1 <»<a/2. Replacing z, a, n respectively by 2/Az, a/Az, t/At 
and observing that for fixed ν 


avAx avAx 


sin ~~ 
a a 
avAg\ tlt Drv? ΔΝ 2/4t 
(6.9) (cos ) κω (: — “= mw ε΄ Drtla 
a a . 


q z/2Az 
(4ρ4ᾳ) 23: (2) Pe δ clehrsiD. 


7 In the symmetric case c = 0 (i.e., p = 4), formula (6.8) agrees with the limiting 
distribution for first-passage times derived by elementary methods in III(8.c). 
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we obtain formally the limiting form 


fo 4) 
(6.10) = f(z, ὃ S QeDa~e FID 6. τόνε αἱ gin wid 

v=] a 
The formal passage to the limit is justified because of uniform con- 
vergence: the contribution of the terms with large ν is negligible both 
in (6.10) and in the original sum (5.7) (where we have ν < a/2). 

In diffusion theory (6.10) is known as Fiirth’s formula for first 
passages and is derived directly from the Fokker-Planck equation. In 
free diffusion the integral over (6.10), extended over the time interval 
(t1, tg), gives the probability that a particle starting at z > 0 will within 
that time interval for the first time reach the origin without having 
previously passed the barrier a. 


*7. RANDOM WALKS IN THE PLANE AND SPACE 


In a two-dimensional random walk the particle moves in unit steps 
in one of the four directions parallel to the z- and y-axes. For a par- 
ticle starting at the origin the possible positions are all points of the 
plane with integral-valued coordinates. Each position has four neigh- 
bors. Similarly, in three dimensions each position has six neighbors. In 
order to define the random walk the corresponding four or six prob- 
abilities must be specified. For simplicity we shall consider only the 
symmetric case where all directions have the same probability. The 
complexity of problems is considerably greater than in one dimension, 
for now the domains to which the particle is restricted may have arbi- 
trary shapes so that complicated boundaries take the place of the 
single-point barriers in the one-dimensional case. 

We begin with an interesting theorem due to Polya.® 


Theorem. In the symmetric random walks in one and two dimensions 
there 1s probability one that the particle will sooner or later (and therefore 
infinitely often) return to its initial position. In three dimensions, how- 
ever, this probability 18 only about 0.35 (the expected number of returns 
is then 0.6524(0.35)* = 0.35/0.65 ~ 0.53). 


Before proving the theorem let us give two alternative formulations, 
both due to Polya. First, it is almost obvious that the theorem implies 


* This section treats a special topic and may be omitted at first reading. 

8G. Polya, Uber eine Aufgabe der Wahrscheinlichkeitsrechnung betreffend die 
Irrfahrt im Strassennetz, Mathematische Annalen, vol. 84 (1921), pp. 149-160. The 
numerical value 0.35 was calculated by W. H. McCrea and F. J. W. Whipple, 
Random paths in two and three dimensions, Proceedings of the Royal Society of 
Edinburgh, vol. 60 (1940), pp. 281-298. 
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that in one and two dimensions there is probability 1 that the particle will 
pass infinitely often through every possible point; in three dimensions this 
is not true, however. Thus the statement “‘all roads lead to Rome”’ is, 
in a way, justified in two dimensions. 

Alternatively, consider two particles performing independent sym- 
metric random walks, the steps occurring simultaneously. Will they 
ever meet? To simplify language let us define the distance of two 
possible positions as the smallest number of steps leading from one 
position to the other. (Then distance = sum of absolute differences of 
the coordinates). If the two particles move one step each, their mutual 
distance either remains the same or changes by two units, and so their 
distance either is even at all times or else is always odd. In the second 
case the two particles can never occupy the same position. In the 
first case it is readily seen that the probability of their meeting at the 
nth step equals the probability of the first particle’s reaching in 2n 
steps the initial position of the second particle. Hence our theorem 
states that in two, but not in three, dimensions the two particles are 
sure infinitely often to occupy the same position. If the initial dis- 
tance of the two particles is odd, a similar argument shows that they 
will infinitely often occupy neighboring positions. If this is called 
meeting, then our theorem asserts that in one and two dimensions the 
two particles are certain to meet infinitely often, but in three dimensions 
there 1s a positive probability that they never meet. 


Proof. For one dimension the theorem has been proved in example 
XIII(3.b), except that there we referred to a coin-tossing game rather 
than to a symmetric random walk. The proof for two and three dimen- 
sions proceeds along the same lines. Let uw, be the probability that 
the nth trial takes the particle to the initial position. According to 
theorem 2 of chapter XIII, section 3, we have to prove that in the case 
of two dimensions =u, diverges, whereas in the case of three dimensions 
Zu, ~ 0.53. In two dimensions a return to the initial position is pos- 
sible only if the numbers of steps in the positive z- and y-directions 
equal those in the negative z- and y-directions, respectively. Hence 
Un = 0 if nis odd and (using the multinomial distribution VI(9.2) 


Ε 1 2n nt 7 1 (71 n (") 
(7.1) an = ΤΣ - ἴα -- ἢ! 4"\n & k 


2n\? 
The last expression equals a ) , by formula II(12.11). Stir- 
n 


ling’s formula shows that wen is of the order of magnitude 1/n, so that 
Tue, diverges as asserted. 
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In the case of three dimensions we find similarly 


1 (2n)! 


> 


7.2 Ug ae ee ΟῚ 
(7.2) om 62" κα iYlelel(n — 7 — Kn -- 7 — k)! 


the summation extending over all j,k withj +k <n. It is easily 
verified that 


- 1 (°") . | 1 n! 
: Man 920 \ nn) & 3" Glin —7 — )] 


Within the braces we have the terms of a trinomial distribution, and 
we know that they add to unity. Hence the sum of the squares is 
smaller than the maximum term within braces, and the latter is attained 
when both j and Καὶ are close to n/3. Stirling’s formula shows that this 
maximum is of the order of magnitude n~', and therefore ue, is of the 
magnitude n—? so that Zuen converges as asserted. 

Polya’s theorem is analogous to the facts concerning multiple coin 
tossings discussed in example XITI(3.c). 

We conclude this section with another problem which generalizes 
the concept of absorbing barriers. Consider the case of two dimensions 
where instead of the interval 0 < x < a we have a plane domain D, 
that is, a collection of points with integral-valued coordinates. Each 
point has four neighbors, but for some points of D one or more of the 
neighbors lie outside D. Such points form the boundary of D, and all 
other points are called interior points. In the phe οὐ πίομεὶ σηὰὶ case 
the two barriers form the boundary, and our problem consisted in find- 
ing the probability that, starting from z, the particle will reach the 
boundary point 0 before reaching a. By analogy, we now ask for the 
probability that the particle will reach a certain section of the boundary 
before reaching any boundary point that is not in this section. This 
means that we divide all boundary points into two sets B’ and B’”’. If 
(z, y) is an interior point, we ask for the probability u(z, y) that, 
starting from (x, y), the particle will reach a point of B’ before reaching 
a point of B’. In particular, if B’ consists of a single point, then 
u(x, y) is the probability that the particle will, sooner or later, be ab- 
sorbed at that particular point. 

Let (x, y) be an interior point. The first step takes the particle 
from (zx, y) to one of the four neighbors (71, y), (v, ψΞΕ 1), and if all 
four of them are interior points, we must have 


(7.4) w(x, y) = tue +1,y) + ule —1,y) + 
+ u(z,y +1) + ula, y — 1}. 
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This is a partial difference equation which takes the place of (2.1) (with 
p=q= 4%). If (x+1,y) is a boundary point, then its contribution 
u(x+1, y) must be replaced by 1 or 0, according to whether («+1, y) 
belongs to B’ or B’’. Hence (7.4) will be valid for all interior points if 
we agree that for a boundary point (ξ, ἡ) we put μ(ξ, n) = 1 af (ξ, 7) ts in 
B’ and u(é, ἡ) = 0 af (ξ, 7) 18 in BY’. This convention takes the place of 
the boundary conditions (2.2). 

In (7.4) we have a system of linear equations for the unknowns 
u(x, y); to each interior point there correspond one unknown and one 
equation. The system is non-homogeneous, since in it there appears at 
least one boundary point (ξ, 7) of B’ and it gives rise to a contribution 4 
on the right side. If the domain D is finite, there are as many equations 
as unknowns, and it is well known that the system has a unique solu- 
tion if, and only if, the corresponding homogeneous system (with 
u(é, 7) = 0 for all boundary points) has no non-vanishing solution. 
Now μία, y) is the mean of the four neighboring values μία ΞΕ], y), 
u(x, y4:1) and cannot exceed all four. In other words, u(x, y) has 
neither a maximum nor a minimum in the strict sense, and the greatest 
and the smallest value occur at boundary points. Hence, if all bound- 
ary values vanish, so does u(x, y) at all interior points, which proves 
the existence and uniqueness of the solution of (7.4). Since the bound- 
ary values are 0 and 1, all values u(z, y) lie between 0 and 1, as is re- 
quired for probabilities. These statements are true also for the case 
of infinite domains, as will be seen from a general theorem on infinite 
Markov chains. 


8. THE GENERALIZED ONE-DIMENSIONAL RANDOM WALK 
(SEQUENTIAL SAMPLING) 


We now return to one dimension but abandon the restriction that 
the particle moves in unit steps. Instead, at each step the particle shall 
have probability p;, to move from any point x to x + k, where the integer 
k may be zero, positive, or negative. We shall investigate the following 
ruin problem: The particle starts from a position αὶ such that 0 < 2 <a; 
we seek the probability uz that the particle will arrive at some position 
<0 before reaching any position > a. In other words, the position 
of the particle at time n is the point 2 + X,; + X_+...-+ X, of the 
x-axis, where the {Χμ} are mutually independent random variables 
with the common distribution {p,}; the process stops when for the first 
time either KX, +... X, <OorX, +... X, >a -- Ζ. 


9 Explicit solutions are known in only a few cases and are always very compli- 
cated. Solutions for the case of rectangular domains, infinite strips, etc., will be 
found in the paper by McCrea and Whipple cited in footnote 8. 
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This problem has attracted widespread interest in connection with 
sequential sampling. There the X; represent certain characteristics of 
samples or observations. Measurements are taken until a sum X, + 
+...+ X; falls outside two preassigned limits (our —z and a — 2). 
In the first case the procedure leads to what is technically known as 
rejection, in the second case to acceptance. The first sampling pro- 
cedure of this kind was described by W. Bartky; ” the general theory 
was outlined by A. Wald, to whom the formulation above is due! 

Without loss of generality we shall suppose that steps are possible 
in both the positive and negative directions. Otherwise we would 
have either ὡς = 0 or ως = 1 for all z. 

The probability of ruin at the first step is obviously 


(8.1) Ts = Piz + P21 + Ῥ-.,... Ῥ.-.. 


(a quantity which may be zero). The random walk continues only if 
the particle moved to a position z with 0 < x < a; the probability of 
a jump from z to x is pz_z, and the probability of subsequent ruin is 
then u,;. Therefore 

a—1 


(8.2) Uz = >. UDr—2 + Pe. 


r==1 


Once more we have here a — 1 linear equations for a — 1 unknowns 
uz. The system is non-homogeneous, since at least for z = 1 the 
probability r, is different from zero (steps in the negative direction 
being possible, which obviously implies 7; > 0). We claim that the 
corresponding homogeneous system 


a~—l1 


(8.3) Ἧς = » UrDe—z 


@==] 
has no solution except 0. 

In fact, if 1t had another solution, one of the values uz would be 
largest in absolute value, say u, = M > 0. Suppose first that p_, + 0. 
Since the coefficients p,_, in (8.3) add to at most unity, the equation 
is possible only if all those p,_, which actually appear on the right 
side (with a coefficient different from zero) equal M , and if their 
coefficients add to 1. Hence u,_, = M, and, arguing the same way, 


10W. Bartky, Multiple sampling with constant probability, Annals of Mathe- 
matical Statistics, vol. 14 (1943), pp. 363-377. It is described in example XV (2.7). 

1 A. Wald, On cumulative sums of random variables, Annals of Mathematical 
Statistics, vol. 15 (1944), pp. 288-296. The methods described in the present book 
are different from Wald’s. See also Wald’s book, Sequential analysis, John Wiley 
& Sons, New York, 1947. 
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Uz2 = Uz-3 =...= ἡ = M. However, for 2 = 1 the coefficients 
Pz—z ἴῃ (8.3) add to less than unity, so that M must be zero. The 
same argument obviously applies also if p_; = 0, since we can replace 
p— by some positive coefficient p, with k < 0. 

It follows that (8.2) has a unique solution, and thus our problem is 
determined. Equation (8.2) plays the role of the difference equation 
(2.1). Again we can simplify the writing by introducing the boundary 
conditions 

U, = 1 f «<0 


8.4) 
( “, = O if z>a. 


Then (8.2) can be written in the form 
(8.5) Uz = LUsPr—z, 


the summation now extending over all x (for x > a we have no con- 
tribution owing to the second condition (8.4); the contributions for 
x < 0 add to r, owing to the first condition). 

For large a it is cumbersome to solve a—1 linear equations directly, 
and it is preferable to use the method of particular solutions analogous 
to the procedure of section 2. It works whenever the probability dis- 
tribution {p;} has relatively few positive terms. Suppose that only 
the p, with —»y < k < pw are different from zero, so that the largest 
possible jumps in the positive and negative directions are » and », 
respectively. The characteristic equation 


(8.6) ΣΡ.8" = 1 


is equivalent to an algebraic equation of degree »y + μ. If s is a root 
of (8.6), then uz, = 8 is a formal solution of (8.5) for all z, but this 
solution does not satisfy the boundary conditions (8.4). If (8.6) has 
μ-Ὲν distinct roots 81, 85), ..., then the linear combination 


(8.7) Uz = LA KSx” 


is again a formal solution of (8.5) for all z, but we must adjust the 
constants A; to satisfy the boundary conditions. Now for 0 < z < a 
only values x with --ν + 1<2< a+ y — 1 appear in (8.5). It suf- 
fices therefore to satisfy the boundary conditions (8.4) for z = 0, — 1, 
—2,..., —v+1, andz = a,a+l, ...,d+u—1, so that we have p + ν 
conditions in all. If 8. is a double root of (8.5), we lose one constant, 
but in this case it is easily seen that u, = zs,” is another formal solu- 
tion. In every case the » + v boundary conditions determine the μ + ν 
arbitrary constants. 
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Example. Suppose that each individual step takes the particle to 
one of the four nearest positions, and we let p_s = Ὁ. 1 = ρι = po = 3, 
The characteristic equation (8.6) iss~? + s+ 54s? = 4. Togolve 
it we put ἐ τ s+ 57; with this substitution our equation becomes 
i’ + t = 6, which has the roots = 2, —3. Solving t = s + s—! for 8, 
we find the four roots 


—3 + 5 -3 — δ 
(8.8) 81: ΞΞ 8 ΞΞ 1, s3= sae" aaa 841 Ὁ, y= ge ς 837), 


Since 81 is a double root, the general solution of (8.5) in our case is 
(8.9) Uz = Ay + Aoz + 4.454 + Aagsy?. 


The boundary conditions are uy = u_, = 1, and ug = Ua41 = 0. They 
lead to four linear equations for the coefficients A; and to the final 
solution 


(22 —a )(s3° — 84") — a(s3°~* — s?*~*) 


Ζ 
CI) wel s+ ot iet oa) =a ee 


with sz and s,4 given by (8.8). 


Numerical Approximations. Usually it is cumbersome to find all the roots, 
but rather satisfactory approximations can be obtained in a surprisingly simple 
way. Consider first the case where the probability distribution {px} has mean 
zero. Then the characteristic equation (8.6) has a double root at s = 1, and 
A + Bz is a formal solution of (8.5). Of course, the two constants A and B do not 
suffice to satisfy the 4+» boundary conditions (8.4). However, if we determine 
A and B so that A + Bz vanishes for z=a-+4—1 and equals 1 for z = 0, 
then A + Br > 1 forz <Oand A + Bx >0fora<2z<a+u 580 that A + Bz 
satisfies the boundary conditions (8.4) with the equality sign replaced by “greater 
than or equal to.” The difference A + Bz — u, is therefore a formal] solution of 
(8.5) with non-negative boundary values whence A + Bz — u, = 0. In like man- 
ner we can get a lower bound for τς by determining A and B so that A + Bz 


vanishes for z = α and equals 1 forz = —» +1. Hence we have 
a-—z atyp—z—I1 
---------- < < 
(8.11) Sari 183 ἀπ 


This estimate is excellent provided a is large as compared ton +». (Of course, 
uz = (1 — 2/a) is a better approximation but does not give precise bounds.) 

Next, consider the general case where the mean of the distribution {p;} is not 
zero. The characteristic equation (8.6) has then a simple root ats = 1. The left 
side of (8.6) approaches o as 8 — Ὁ and ass > », For positive s the curve 
y = 2p,s* is continuous and convex, and since it intersects the line y = 1 ats = 1, 
there exists exactly one more intersection. Therefore, the characteristic equation 
(8.6) has exactly two positive roots, 1 and s;. As before, we see that A + Bs)? 
is a formal solution of (8.5), and we can apply our previous argument to this solu- 
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tion instead of A + Bz. We find in this case 


81 — 83° 8. ΠΗ — 317 
i are 7 
gore — 1 


(8.12) 
and have the 


Theorem. The solution of our ruin problem satisfies the inequalities (8.11) if {px} 
has zero mean, and (8.12) otherwise. Here 81 18 the unique positive root different from 
1 of (8.6), and u and —v are defined, respectively, as the largest and smallest subscript 
for which p, ¥ 0. 


Let m = Zkp; be the expected gain in a single trial (or expected length of a single 
step). It is easily seen from (8.6) that s; >1 or 81 <1 according to whether 
m<Qorm>0O. Letting a — οὐ, we conclude from our theorem that in a game 
against an infinitely rich adversary the probability of an ultimate ruin is one if and 
only uf m < 0. 

The duration of game can be discussed by similar methods (cf. problem 4). 


9. PROBLEMS FOR SOLUTION 


1. Consider the ruin problem of sections 2 and 3 for the case of a modified ran- 
dom walk in which the particle moves a unit step to the right or left, or stays 
at its present position with probabilities a, 8, y, respectively (a + 8+ Ύ = 1). 
(In gambling terminology, the bet may result in a tie.) 

2. Consider the ruin problem of sections 2 and 3 for the case where the origin 
is an elastic barrier (as defined in section 1). The difference equations for the 
probability of ruin (absorption at the origin) and for the expected duration 
are the same, but with new boundary conditions. 

3. A particle moves at each step two units to the right or one unit to the left, 
with corresponding probabilities p and ᾳ (p + q = 1). If the starting position 
is z > 0, find the probability that the particle will ever reach the origin. (This 
is the ruin problem against an infinitely rich adversary.) 

Hint: The equation corresponding to (2.1) has the particular solution g, = 1 
and two particular solutions of the form λ΄, where ἃ satisfies a quadratic equa- 
tion. 

4, In the generalized random-walk problem of section 8 put [in analogy with 
(8.1)] pz = ρα--« + Pati-z + Pa+2-2 +..-, and let dz, be the probability 
that the game lasts for exactly ἢ steps. Show that forn> 1 


δι ess is τι 


a—l 


den+1 = 2, denPo—s 


with d,1 = Τῷ + pz. Hence prove that the generating function d,(s) = 2d,,,s" 
is the solution of the system of linear equations 


a—l 
s—‘d,(s) — 2, ἀ,(8)}ω--- = Tz + Pe. 


By differentiation it follows that the expected duration e, is the solution of 


a—l 


6, — Σ) 62}... = 1. 
tem] 
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5. In the random walk with absorbing barriers at the points 0 and a and 
with initial position z, let w,,,(z) be the probability that the nth step takes 
the particle to the position z. Find the difference equations and boundary 
conditions which determine w,,,(z). 


6. Continuation. Modify the boundary conditions for the case of two 
reflecting barriers (i.e., elastic barriers with ὃ = 1), 


Note: In the following problems vz, 18 the probability (6.1) that in an unrestricted 
random walk starting at the origin the nth step takes the particle to the position zx. 


᾿ 7 Method of images.2 Let p = ᾳ = 4. In a random walk in (0, ©) with 
an absorbing barrier at the origin and initial position at 2, let uzn(x) be the 
probability that the nth step takes the particle to the position z. Show that 
Uzn(Z) = τῳ. χα — Uz+e,n- (Hint: Show that a difference equation corre- 
sponding to (4.1) and the appropriate boundary conditions are satisfied.) 


8. Continuation. If the origin is a reflecting barrier, then 
Uz,n(L) = ὕ-- ὦ + Vae+2,n- 


9. Continuation. If the random walk is restricted to (0, a) and both bar- 
riers are absorbing, then 


(9.1) Usz,n(X) = Σ ΤΕ en τ τ Ur+2—2ka,n}, 


the summation extending over all &, positive or negative (only finitely many 
terms are different from zero). If both barriers are reflecting, equation (9.1) 
holds with minus replaced by plus. 


10. Distribution of maxima. In a symmetric unrestricted random walk 
starting at the origin let M, be the maximum abscissa of the particle at times 
0,1, 2,...,. Using the formula of problem 7, show that 


(9.2) P{M, ΞΞ 2) = Van + Vz+i,n- 


11. Let V.(s) = Zvz,n8” (cf. the note preceding problem 7). Prove that 
VAs) = Vo(t)A2~*(s) when x < 0 and V,(s) = Vo(s)Ai~7(s) when x > 0, where 
Ai(s) and A2(s) are defined in (4.8). Moreover, Vo(s) = (1 — 4pqs?)-#. 

Note. These relations follow directly from the fact that Ai(s) and \2(s) are 
generating functions of first-passage times as explained at the conclusion of 
section 4. 


12, In a random walk in (0, ©) with an absorbing barrier at the origin and 
initial position at Ζ, let uz,n(z) be the probability that the nth step takes the 


12 Problems 7-9 are examples of the method of images. The term Ve—z,n COr- 
responds to a particle in an unrestricted random walk, and Yz4z2,n to an “image 
point.’”’ In equation (9.1) we find image points starting from various positions, 
obtained by repeated reflections at both boundaries. In problems 12 and 13 we 
get the general result for the unsymmetric random walk using generating functions. 
In the theory of difference equations the method of images is always ascribed to 
Lord Kelvin. The equivalent reflection principle is generally attributed to D. 
André. See footnote 4 of chapter III. | 
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particle to the position x, and let 
(9.3) UAs;2) = >> tUzn(x)s”. 
n=0 


Using problem 11, show that U.(s; x) = Vz—.(s) — A2*(s)V2(s). Conclude 
(9.4) Uzn() = Ve—2n — (Q/P)?Vate,n- 


Compare with the result of problem 7 and derive (9.4) from the latter by 
combinatorial methods. 

13. Alternative formula for the probability of ruin (5.7). Expanding (4.11) 
into a geometric series, prove that 


00 p ka © 00 p ka—z 
Uzn = > ¢ Wz42ka.n » (4) Weka —z,.n 
k=0 \q κει \ 
with wz, defined in (5.9). 


14. If the passage to the limit of section 6 is applied to the expression for 
Uz,» given in the preceding problem, show that the probability of absorption 
during a short time interval of length Aé is asymptotically 


= = δι(π Ὁ ἢ —tp—e(et+z)/D os (2+ Qha)e te +2ka)?/De 
(Hint: Apply the normal approximation to the binomial distribution.) 

15.14 Renewal method for the ruin problem. In the random walk with. two 
absorbing barriers let uz,, and uz,,* be, respectively, the probabilities of ab- 
sorption at the left and the right barriers. By a proper interpretation prove 
the truth of the following two equations: 


V_As) = U.(s)Vo(s) + Uz*(s)V -als), 
Va-z(s) = U.(s)Va(s) + Uz*(s)Vo(s). 


By solving this system for U,(s), derive (4.11). 

16. Let u,,,(z) be the probability that the particle, starting from 2, will at 
the nth step be at x without having previously touched the absorbing barriers. 
Using the notations of problem 15, show that for the corresponding generating 
function U,(s; 2) = Dtz,n(x)s” we have 


UAs; x2) = Vz-As) — Us) V.(8) -- U2*(8)Vz—d(8). 


(No calculations are required.) 

17. Continuation. The generating function U,(s; x) of the preceding prob- 
lem can be obtained by putting U.(s;z) = Vz_.(s) — Adi*(s) — Bd2%(s) and 
determining the constants so that the boundary conditions U.(s; xz) = 0 for 
z= 0 and z =a are satisfied. With reflecting barriers the boundary condi- 
tions are Uo(s; x) = Ui(s; xz) and U.(s; x) = Ua_i(s; 2). 


13The agreement of the new formula with the limiting form (6.10) is a well- 
known fact of the theory of theta functions. 

14 Problems 15-17 contain a new and independent derivation of the main results 
concerning random walks in one dimension. 
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18.15 A symmetric unrestricted random walk starts at the origin. The 
probability that the rth return to the origin occurs at the nth step equals the 
probability that the first passage through r occurs at the (n—r)th step. 
(Hint: Compare the generating functions.) 

19. Prove the formula 

Ven = (Qt) !Qrp(m +2) 12g(n—2) [2 f cos" -cos ἐπα. 


ἘΞ ἢ 


by showing that the appropriate difference equation is satisfied. Conclude that 


γιὸ = (2n)-1 (2)"" 


sa cos ἐξ di 
_x 1 — 2pq)?-s-cost ” 


20. In a three-dimensional symmetric random walk the particle has prob- 
ability one to pass infinitely often through any particular line z = my=n. 
(Hint: Cf. problem 1.) 

21. In a two-dimensional symmetric random walk starting at the origin 
the probability that the nth step takes the particle to (z, y) is 


(2:r) τὰ ἢ 1. (cos a + cos β)"- 608 za-cos y8-da dg. 


Verify this formula and find the analogue for three dimensions. (Hint: Check 
that the expression satisfies the proper difference equation.) 

22. In a two-dimensional symmetric random walk let D,? = zx? + y” be 
the square of the distance of the particle from the origin at time n. Prove 
E(D,”) = n. (Hint: Calculate E(D, ει -- D,2).] 

23. In a symmetric random walk in d dimensions the particle has probability 
1 to return infinitely often to a position already previously occupied. (Hint: 
At each step the probability of moving to a new position is at most (2d — 1) + 
+ 2d.) 


16 This is theorem 3 of chapter III, section 4. 


CHAPTER XV 


Markov Chains 


1. DEFINITION 


Up to now we have been concerned mostly with independent trials 
which can be described as follows. A set of possible outcomes F), 
E,, ..., (finite or infinite in number) is given, and with each there is 
associated a probability p;,; the probabilities of sample sequences are 
defined by the multiplicative property P{(H;,, Hj, ..., #;,)} = 
= Pj, Pj, *** Pj, In the theory of Markov! chains we consider the 
simplest generalization which consists in permitting the outcome of any 
trial to depend on the outcome of the directly preceding trial (and only 
on it). The outcome £E;, is no longer associated with a fixed probability 
px, but to every pair (Ε;, Επ) there corresponds a conditional probability 
Pix; given that ΕἸ; has occurred at some trial, the probability of E;, at 
the next trial is p;,. In addition to the p;, we must be given the prob- 
ability a, of the outcome HE; at the znitzal trial. For pj, to have the 
meaning attributed to them, the probabilities of sample sequences 
corresponding to two, three, or four trials must be defined by 


P{(E;, Ex)} = aypje, Ῥί(; En, Ey)} = aypjeper, 
P{(E;, Ex, Ey, Es)} = QjDjkDkrPrey 
and generally 
(1.1) P{(E;,, Bi, ..-, Hi,)} = Ci DiciPirie °° * Pin-rin1Pin—rine 


Here the initial trial is numbered zero, so that trial number one is the 
second trial. (This convention is convenient and has been introduced 
tacitly in the preceding chapter.) 


Examples. (a) Every Markov chain is equivalent to an urn model 
as follows. Each occurring subscript is represented by an urn, and 


1A, A. Markov (1856-1922). 
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each urn contains balls marked £;, Ho, .... The composition of the 
urns remains fixed, but it varies from urn to urn; in the jth urn the 
probability to draw a ball marked F, is pik. At the initial, or zero-th, 
trial an urn is chosen in accordance with the probability distribution 
{a;}. From that urn a ball is drawn at random, and if it is marked Ε;, 
the next drawing is made from the jth urn, etc. Obviously with this 
procedure the probability of a sequence (Hj, ..., E;,) is given by (1.1). 
We see that the notion of a Markov chain is not more general than urn 
models, but the new symbolism will prove more practical and more 
intuitive. 

(ὁ) Independent trials are, of course, the special case of our scheme 
with pj, = a, for each 1. 


If a, is the probability of FE, at the initial (or zero-th) trial, we must 
have a, > 0 and Ya, = 1. Moreover, whenever ΕἸ; occurs it must be 
followed by some Ey, and it is therefore necessary that for all 7 and k 


(1.2) Pit Pt pig+...=1, ped>O. 


We want to show that for any numbers a; and Dx satisfying these 
conditions, the assignment (1.1) is a permissible definition of probabil- 
ities in the sample space corresponding to n + 1 trials. The numbers 
defined in (1.1) being non-negative, we need only prove that they add 
to unity. Fix first jo, j1, ..-, jn—; and add the numbers (1.1) for all 
possible j,. Using (1.2) with j = jn—1, We see immediately that the 
sum equals 4;,pj,;, ..- Dj,sj,,- Thus the sum over all numbers (1.1) 
does not depend on n, and since 2a;, = 1, the sum equals unity for all n. 

The definition (1.1) depends formally on the number of trials, but 
our argument proves the mutual consistency of the definitions (1.1) 
for all n. For example, to obtain the probability of the event “the 
first two trials result in (E;, E;),” we have to fix jo = jandj, = k, and 
add the probabilities (1.1) for all possible 72, J3) «++ Jn. We have just 
shown that the sum is a;p,, and is thus independent of n. This means 
that it is usually not necessary explicitly to refer to the number of 
trials; the event (E;,, ..., E;,) has the same probability in all sample 
spaces of more than r trials. In connection with independent trials 
it has been pointed out repeatedly that, from a mathematical point of 
view, it is most satisfactory to introduce only the unique sample space 
of unending sequences of trials and to consider the result of finitely 
many trials as the beginning of an infinite Sequence. This statement 
holds true also for Markov chains. Unfortunately, sample spaces of 
infinitely many trials lead beyond the theory of discrete probabilities 
to which we are restricted in the present volume. 


940 MARKOV CHAINS [XV.1 


To summarize, our starting point is the following 


Definition. A sequence of trials with possible outcomes Ey, Eo, ... 
will be called a Markov chain? 1 the probabilities of sample sequences are 
defined by (1.1) in terms of an initial probability distribution {a,} for 
the states Ej, at tume 0 and fixed conditional probabilities p;, of Ex, given 
that E; has occurred at the preceding trial. 


We shall now modify our terminology to conform to the usage in 
physical applications. Instead of saying “the nth trial results in Ey,” 
we shall say that at tume n the system is in state Ey. The conditional 
probability p;, will be called the probability of the transition E; > E, 
(from state E; to state E;). | 

The transition probabilities p;, will be arranged in a matrix of transi- 
tion probabilities 


P11 12 Pi3 
P21 P22 poz 
(1.3) P=} 931 P32 P33 


where the first subscript stands for row, the second for column. Clearly 
P is a square matrix with non-negative elements and unit row sums. 
Such a matrix (finite or infinite) is called a stochastic matrix. Any 
stochastic matrix can serve as a matrix of transition probabilities; together 
with our intial distribution {a,} it completely defines a Markov chain 
with states Hy, Ho, .... 

In some special cases it is convenient to number the states starting 
with Ὁ rather than with 1. A zero row and zero column are then to 
be added to P. | 


2. ILLUSTRATIVE EXAMPLES 


This section contains examples which will familiarize the reader with 
the notion of a Markov chain. To save space we shall refer to some 
of them as the occasion arises, but the reader is advised not to store 


* This is not the standard terminology. We are here considering only a special 
class of Markov chains, and, strictly speaking, here and in the following sections 
the term Markov chain should always be qualified by adding the clause “with 
constant transition probabilities.” Actually, the general type of Markov chain 
is rarely studied. It will be defined in section 10, where the Markov property will 
be discussed in relation to general stochastic processes. There the reader will also 
find examples of dependent trials that do not form Markov chains. 


XV.2] ILLUSTRATIVE EXAMPLES 341 


the examples in his mind. For the classical example of card shuffling 
see section 9. | | 

(a) Suppose that there are only two states E;, Hz also called “‘suc- 
cess” and “‘failure.”” The matrix P is of the form 


p=|? ἢ ptrq=pt+qd=1 
and p,p’ are the probabilities of success following success, and suc- 
cess following failure, respectively. For a particular example, imagine 
a ball moving with velocity +1 in the direction of the z-axis. At times 
1, 2, ... the ball reverses its direction with probability g, and keeps it 
with probability p. If £; stands for velocity +1 and EF, for —1, the 
matrix of transition probabilities is of the form described with φ' = p 
and p’ = ᾳφ. (This experiment could be simulated by means of a large 
regular pegboard.) 

(b) Random walk with absorbing barriers. Let the possible states be 
Eo, £1, ..., Ha and consider the matrix of transition probabilities 


1000... 0 0 0 
q O0O pO... 0 0 0 
0 ¢qg Op... 0 0 0 


000 0 ... ¢g O p 
0000... 00 1 


From each of the “interior” states H,, ..., Ha_1 transitions are pos- 
sible to the right and the left neighbors (with p;,:41 = p and p;;_1 = 4). 
However, no transition is possible from either Hp or E, to any other 
state; the system may move from one state to another, but once Zp or 
E, is reached, the system stays there fixed forever. Clearly this 
Markov chain differs only terminologically from the model of a ran- 
dom walk with absorbing barriers at 0 and a discussed in the last 
chapter, There the random walk started from a fixed point z of the 
interval. In Markov chain terminology this amounts to choosing the 
initial distribution so that a, = 1 (and hence a, = Oforz # z). If we 
had chosen the initial state at random we would have a; = (a+ 1)71 
fork =0,1, ..., 4a. 

(c) Elastic barriers. We next consider a matrix which differs from 
the preceding one only in the rows number 1 and a — 1. Choose 


342 MARKOV CHAINS [XV.2 
0 < d9 < 1 and O < 6, < 1 and set 


1 0 00... 0 0 0 
(Ι -- δ)ὴφ δᾳ p O ... O 0 0 
0 GOD? ... 9 0 0 
Ρ- . se a e 
0 0 00 ... Ὁ p 0 
0 0 00... ᾳ dap ( — &)p 
0 0 00 ... 6 0 1 


The transition probabilities are the same as before except that from 
EE, a passage to Ep has only probability (1 — δο)ᾳ, and with probability 
dog the system stays at 1; a similar statement holds for H,_;. For 
59 = δὰ = Ο our matrix is identical with the preceding one. When 
do = δὰ = 1, no passage into Kp and E, is possible; a system starting 
at an interior state E; will move from state to state but never enter Eo 
or Hy. In random-walk terminology this last situation corresponds to 
reflecting barriers (cf. chapter XIV). In betting language the state of 
the system represents the capital of a player in a game where the two 
players own between them the amount a. Each time the first player 
loses his last dollar, the adversary replaces it with probability 59, and 
with probability 1 — 69 the game terminates. With two reflecting 
barriers the game never terminates. | 

(d) Cyclical random walks. Again let the possible states be £,, Eo, 
..., Μὰ but order them cyclically so that FE, has the neighbors Fa_1 
and £,. If, as before, the system always passes either to the right 
or to the left neighbor, the rows of the matrix P are as in example 
(b), except that the first row is (0, p,0,0,...,0,¢) and the last 
(p, 0, 0, 0, ἜΦΥ 0, 4, 0). 

More generally, we may permit transitions between any two states. 
Let 40, 91, --+) Ga—1 be, respectively, the probability of staying fixed 
or moving 1, 2, ..., a—1 units to the right (where & units to the right 
is the same as a — k units to the left). Then P is the cyclical matrix 


do 1 G2 ... Qa-2 Ya—1 
4α---Ἰ 40 Qi +++ 4κ--8 Ya—2 
P=|@-2 ἤα-ἰ Yo +++ 4α--4 Ya—3 


σι 42 G3 ... Qa-1 do 
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If qi = p, a1 = 4, andgq, = Oforl <k <a— 1, then this random 
walk reduces to the simple case discussed at the beginning of this 
example. [The discussion is continued in example XVI (2.d).] 

(e) Unrestricted random walks. An unrestricted one-dimensional 
random walk is a Markov chain, but it is most natural to order the 
states in a doubly infinite sequence (... E_», E_1, Eo, Ey, Eo, ...). 
In order to write the matrix of transition probabilities in the familiar 
form, we must rearrange the states. For example, for the ordering 
(Ko, £1, E_1, Eo, H_», ...) the first row of P becomes (0, p, g, 0,0, ...), 
the second (q, 0, 0, p, 0,0, ...), etc. Unfortunately, the natural sym- 
metry is lost, and the formulas become unpleasant. The situation 
grows even worse in two dimensions. In such cases the methods of 
this chapter are not convenient for deriving explicit formulas, but the 
general theorems apply and contain pertinent information. 

(f) The Ehrenfest model of diffusion. Once more we consider a 
chain with the a + 1 states Eo, E;, ..., E, and transitions possible 
only to the right and to the left neighbor; however, this time we put 
Pj,j41 = 1 -- 7)α and p;,;_1 = j/a, so that 


0 1 0 0 .-- O 0 
at 0 1-a@ 0 .-. O 0 
0 2a 0 1-—2a ... 0 0 
P= 
0 0 0 0 x 0 αἱ 
0 0 0 0 χά wie <0 


This chain has two interesting physical interpretations. For a dis- 
cussion of various recurrence problems in statistical mechanics P. and 
T. Ehrenfest * described a conceptual experiment where a molecules 
are distributed in two containers A and B. At time n a molecule is 
chosen at random and removed from its container to the other. The 
state of the system is determined by the number of molecules in A. 
Suppose that at a certain moment there are exactly Καὶ molecules in the 


*P. and T. Ehrenfest, Uber zwei bekannte Einwdnde gegen das Boltzmannsche 
H-Theorem, Physikalische Zeitschrift, vol. 8 (1907), pp. 311-314. Ming Chen 
Wang and G. E. Uhlenbeck, On the theory of the Brownian motion II, Reviews of 
Modern Physics, vol. 17 (1945), pp. 8323-342. For a more complete discussion (by 
methods essentially equivalent to those of chapter XVI) see M. Kac, Random 
walk and the theory of Brownian motion, American Mathematical M onthly, vol. 54 
(1947), pp. 369-391. See also B. Friedman, A simple urn model, Communications 
on Pure and Applied Mathematics, vol. 2 (1949), pp. 59-70. 
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container A. At the next trial the system passes into Hy, or Ex4 
according to whether a molecule in A or B is chosen; the corresponding 
probabilities are k/a and (a — k)/a, and therefore our chain describes 
Ehrenfest’s experiment. However, our chain can also be interpreted 
as diffusion with a central force, that is, a random walk in which the 
probability of a step to the right varies with the position. From az = 7 
the particle is more likely to move to the right or to the left according 
as 7 < a/2 or j > a/2; this means that the particle has a tendency to 
move toward x = a/2, which corresponds to an attractive elastic force 
increasing in direct proportion to the distance. (The Ehrenfest model 
has been described in example V(2.c); see also example (6.a) and prob- 
lem 12.) 

(g) Occupancy problems. In chapter I we considered random place- 
ments of balls into a cells. Let the number of occupied cells determine 
the state of the system. If 7 cells are occupied, the probability that 
the next ball is placed into an empty cell is (a —j)/a. Hence the 
experiment is described by a chain with transition probabilities 
Di; = 9/4, Pj,j41 = (α —j)/a, and p;,, = Ὁ for all other combinations 
of jand k. The initial distribution (all cells empty) is given by po = 1, 
pe = Ofor1<k<a. [Cf. example XVI(2.e).] 

(h) Success runs. In a sequence of Bernoulli trials we agree to say 
that at time n we observe the state Eo if the nth trial results in failure, 
and the state Εἶμ (k = 1, 2, ..., n) if the last failure occurred at trial 
number n — k (the zero-th trial counting as failure). In other words, 
the index k of the state EH; indicates the length of the uninterrupted 
sequence of successes ending at the nth trial. It is obvious that we are 
dealing with a Markov chain in which only the transitions H, — Ko 
and HK, — κι are possible, and the matrix of transition probabilities 
takes on the form 


q p 0 0 0 
q 0 p 0 0 
qg 0 0 p 0 


(ἢ) Recurrent events. The example above is a special case of a more 
interesting Markov chain. Let & be an arbitrary recurrent event with 
the distribution of recurrence times given by {f,}. Conventionally 
we say that at the zero-th trial & did occur. We say that at time n 
the system is in state Ey if & occurs at the nth trial, and in state EF; if 
the last occurrence of & took place at trial number n — k. (In a man- 
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ner of speaking, we are dealing with the waiting time in the negative 
direction.) As in the last example, it is clear that the state E;, at the 
nth trial can be succeeded only by Ep (if & occurs) or by Heys. Put 


Sus 1 — Sy41 
2 pee l= a= —— 
1 - δὰ 1 -- & 


(2.1) 8. =fit...th, ᾳ Ξ 


Observing E;, means that the waiting time for & exceeds k, and the 
probability of & occurring at the next trial under this hypothesis equals 
gz. Accordingly the transitions ΒΕ, — Ey, and E, — Ep have prob- 
abilities p, and σι», respectively. A typical sample sequence is of the 
form Hobo hy Bek3EoEyEopE,E,EoEo (the first Eo representing the 
zero-th trial). Here the waiting times are successively 1, 4, 2, 3, 1, 
and the probability of our sequence equals f, fafofsfi. Now 


(2.2) Sifsfofsh: = GoPoP1P293P091PoP19290 


in accordance with the rule (1.1) for probabilities in Markov chains. 
This reasoning applies to all sequences, and we see that the process is 
a Markov chain with the matrix 
% P 9 0 0 
a ὁ pm 0 
qa D9 0 Pe 
qa 9 O O ps 


Φ ὦ 


[Continued in example (6.c).] 

(7) Sequential sampling. As we have seen in chapter XIV, section 
8, the following problem occurs in sequential sampling. LetS, = X, + 
+...+ Xn, where the X, are mutually independent random variables 
assuming only integral values and having a common distribution {p,;}, 
k=0, +1, +2, .... For preassigned z > 0, b > 0 there exists a 
smallest n for which either S, > ὃ orS, < —z. This 7 is, of course, a 
random variable, and we are interested in its distribution and in the 
probabilities of the two contingencies S, < —zand§S, > b. 

The problem can be formulated in terms of a Markov chain with 
states 0, 1, 2, ..., b+2 as follows. Let a = ὃ + z — 1 and choose z 
for the initial state. We say that at time n the system is in the state 
x (where x = 1, 2,..., a) if2 +S, = x provided, however, none of the 
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sums Ζ2- 51, ..., 2+S,_1 is <0 or >a; and with the same proviso we 
say that the system is in state 0 if 2 +S, < 0 and in state a+ 1 if 
z+S,2>a+1. Once the system passes into one of the two limit- 
ing states 0 and a+ 1, it remains there forever (that is, we put 
Po,0 = Pa+1,a41 = 1). The matrix of transition probabilities is 


1 0 0 0 ae 0 0 
ΤΊ Po P1 Pe -++ Da-1 Pl 
Yo =p~i Po P1 -++ DPa—2 p2 
P= 3 ~=p-2 Ρ-- Po -++ Pa—3 PZ 
Toa P-—a+1 P-a+2 P-ai3 ... Po Pa 
0 0 0 0 alee 0 1 
where 
Te = Pe + P_kr—1 + P_zr—og + p_r_-a +... 
and 


Pk = Pa—k+41 + Pa—k-4-2 Ἔ.... 


As an illustration, take Bartky’s double-sampling inspection scheme. 
To test a consignment of items, samples of size N are taken and sub- 
jected to complete inspection. It is assumed that the samples are 
stochastically independent and that the number of defectives in each 
has the same binomial distribution. Allowance is made for one defec- 
tive item per sample, and so we let X; + 1 equal the number of defec- 
tives in the kth sample. Then for k > 0 


N 
2.3 = ee eal 
(2.3) Pk (, , ἡ)» 4 


and »..1 = αΝ,}ς; = Ofor x < —1. The procedural rule is as follows: 
A preliminary sample is drawn and, if it contains no defective, the 
whole consignment is accepted; if the number of defectives exceeds a, 
the whole lot is rejected. In either of these cases the process stops 
and we have no Markov chain. If, however, the number z of defectives 
lies in the range 1 < z < a, the sampling continues in the described 
way as long as the state of the chain is contained between 1 and a. 
Sooner or later it will pass either into 0, in which case the consignment 
is accepted, or into a + 1, in which case the consignment is rejected. 
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(k) An example from genetics. Consider a population kept constant 
in size by the selection of N individuals in each successive generation. 
A particular gene assuming the forms A and a has 2N representatives; 
if in the nth generation A occurs 7 times, then a occurs 2N — 7 times. In 
this case we say that the population is at time n in state 7 (00<j<2N). 
Assuming random mating, the composition of the following generation 
is determined by 2N Bernoulli trials in which the A-gene has probability 
j/2N. We have therefore a Markov chain with 


2 j k j 2N—k 
( 7 a (: = 
(Cf. example (8.c).] 

(ἢ A breeding problem. In the so-called brother-sister mating two 
individuals are mated, and among their direct descendants two indi- 
viduals of opposite sex are selected at random. These are again mated, 
and the process continues indefinitely. With three genotypes AA, Aa, 
aa for each parent, we have to distinguish six combinations of parents 
which we label as follows: EF, = AA X AA, Ez =AAX Aa, E3 = 
= Aa X Aa, Ey, = Aa X aa, Es = aa X aa, Ey = AA X aa. Using 
the rules of chapter V, it is easily seen that the matrix of transition 
probabilities is in this case 


200 on i μα 
Ξ Φ OS aie ve Ἕῇἷ ; 
μι © PH ah Al © 

Oo Vek be CO © 
Oo —™ wie oH oo 
Oo Φ ὦ I" ὦ ὦ 


om) 


[The discussion is continued in problem 4; a complete treatment is 
given in example XVI(4.b).] 


3. HIGHER TRANSITION PROBABILITIES 


A transition from E; to E, in exactly n steps can occur via different 
paths Ε; — E;, > E;, >...— E;_, — Ey. The conditional prob- 


‘This problem was discussed at length by R. A. Fisher and 8. Wright. The 
formulation in terms of Markov chains is due to G. Malécot, Sur un probléme de 
probabilités en chaine que pose la génétique, Comptes rendus de l’Académie des 
Sciences, vol. 219 (1944), pp. 379-381. 
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ability that the system passes through this particular path given that 
it is at EH; is p;;,Pj,j7, °° * Pj,_sk- The sum of the corresponding expres- 
sions for all possible paths is the probability of finding the system at 
time r + n in state Εἰς, given that at time r tt was in state E;. We denote 
at by pe. 

We have, in particular, p& = pjx, and 


(3.1) ρὴ = Dy PivPrk- 


By induction we find easily the recursion formula 


(3.2) pytY = a> DivD® ; 


a further induction on m shows that more generally 


(3.3) pet = a> pi p@. 


This equation reflects the simple fact that the first m steps lead the 
system from H; to some intermediate state H,, and the last n steps 
from E, to Ey. The identity (3.3) is characteristic for Markov chains. 
For more general processes (cf. section 10) an analogous equation holds, 
but the last factor depends not only on » and k but also on 7. 

In the same way as the p,z, form the matrix P, we arrange the p¥? in 
a matrix to be denoted by P”. Equation (3.2) states that to obtain the 
element pyr of P”*t! we have to multiply the elements of the jth 
row of P by the corresponding elements of the kth column of P” and 
add all products. This operation is called row-into-column multiplica- 
tion of the matrices P and P” and is expressed symbolically by the 
equation P”t! = PP”. This suggests calling P” the nth power of P; 
equation (3.3) expresses the associative law P™t" = P™P”. 

In order to have (3.3) true for all n > 0 we define p® by p = 1 
and pi?) = 0 for j γέ kas is natural. 


Examples. (a) In the trivial case of independent trials all rows of 
P are identical, and it is clear without calculations that P” = P for 
all n. 

(b) In the success run, example (2.h), the n-step transition prob- 
abilities can be written down directly. For example, in three steps the 
system can pass from E;, only i into Hxis, Eo, Εἰ, Ez and the correspond- 
ing probabilities are clearly p®, 4, gp, gp®. Thus 
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qqp ρ»20 0... q ap gp* p® 0 0 
qqp 0 p* O ... q ap gp’ 0 p® 0 
P?=|q gp 0 0 ρ...}|' P>=/1q gp gq’ 0 0 8 


In this case it is clear that P” Seles to a matrix such that all ele- 
ments in the column number k equal gp*. 


Absolute Probabilities 


Let again a; stand for the probability of the state E; at time 0. The 
(unconditional) probability of finding the system at time n in state E; 
is then 


(3.4) a,” = = 2, α;ρῇ. 


mae we let the process start from a fixed state H;, that is, we put 

: ΞΞ 1. In this case a = p®. 

" We feel intuitively that the influence of the initial state on the prob- 
ability distribution at time n should gradually wear off so that for 
large n the distribution (3.4) should be nearly independent of the 
initial distribution {a;}. This is the case if (as in the last example) 
py converges to a limit independent of j, that is, if P” converges to a 
matrix with identical rows. We shall see that this is usually so, but 
once more we shall have to take into account the annoying exception 
caused by periodicities. 


4. CLOSURES AND CLOSED SETS 


We shall say that E;, can be reached from E; if there exists some n > 0 
such that py > 0 (i.e., if there is a positive probability of reaching EF, 
from E; including the case E, = E;). For example, in an unrestricted 
random walk each state can be reached from every other state, but 
from an absorbing barrier no other state can be reached. 


Definition. A set C of states is closed if no state outside C can be 
reached from any state E; in C. The smallest closed set containing C is 
called the closure of C. 

A single state Εἶμ forming a closed set will be called absorbing. 

A Markov chain 1s irreducible if there exists no closed set other than 
the set of all states. 
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Clearly Οἱ is closed if, and only if, pj, = 0 whenever j is in C and k 
outside (΄, for in this case we see from (8.2) that p = 0 for every n. 
We have then the obvious 


Theorem. [f in the matrices P” all rows and all columns correspond- 
ing to states outside the closed set C are deleted, there remain stochastic 
matrices for which the fundamental relations (3.2) and (3.3) again hold. 


This means that we have a Markov chain defined on C, and this 
subchain can be studied independently of all other states. 

The state EH; is absorbing if, and only if, Pyx = 1; in this case the 
matrix of the last theorem εὐ αδδὰ to ἃ ἘΠΕ δἰδαίδαι, The closure of 
a single state ΕἸ; is the set of all states which can be reached from it (includ- 
ing Ej). This remark may be reformulated in the form of the follow- 
ing useful 


Criterion. A chain is irreducible if, and only if, every state can be 
reached from every other state. 


Example. In order to find all closed sets it suffices to know which 
p;x vanish and which are positive. Accordingly, we use a * to denote 
positive elements and consider a typical matrix, say 


000+ 0000 * 
O* * 0 * 000 5 
0000000 * 0 
* 0000000 0 
P=|0 000+ 00 0 0 
0 * 0000000 
0 000 * * 0 0 
00* 00000 0 
000 * 00 0 0 #4. 


In the fifth row a * appears only at the fifth place, and therefore 
Pss = 1: the state Εἷς is absorbing. The third and the eighth row con- 
tain only one positive element each, and it is clear that Ez and Es 
form a closed set. From E, passages are possible into H, and Eo, 
and from there only to E,, #4, Ερ. Accordingly the three states Ej, 
ἔς, ἔφ form another closed set. 

It is now apparent that the complication of P arises mainly from an 
inconvenient notation. Let us relabel the states as follows: 
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Ei") = Es; E'y=E3; E's = Ερ; E'g = Ey; E's = Ep: 
Ε'ς = Ey; Ε΄ = Eo; K's = Ez; E’y = Eg. 


The elements of the matrix P are rearranged in like manner, and P 
takes on the form 


* 0 0 00 0 0 0 0 


0 0* 000 000 

0 *0 000 000 
p [9 90 0* * 000 
~{0 00 0* * 000 
0 00 *00 000 


* «*« @Q 0 * Q 
0 OO 000 * oo * 
0 OO 0 0 0 


In this form the closed sets (71), (E’2, E’3) and (E’4, E’s, B's) are evi- 
dent. From E’; a passage is possible into each of these three closed 
sets, and therefore the closure of E’7 is the set of states E',, E's, E's, 
E's, Εἰς, E'e, ἘΠ. From E’s a passage is possible into FE’, and E’y and 
hence into each closed set: the closures of E’g and of E’s consist of 
all nine states. 

Deleting all rows and all columns outside a closed set, we obtain the 
three stochastic submatrices 


QO * αὶ 
(4.1) [:] ; | QO * * 
0 0 


and P’ contains no other stochastic submatrices. 
The reader is asked to find for himself the absorbing states and the 
closed sets in the matrices of the examples of section 2. 


5. CLASSIFICATION OF STATES 


Consider an arbitrary, but fixed, state Ε; and suppose that initially 
the system is in H;. Every time the system passes through E; the 
process recommences from scratch exactly as it has begun. It is there- 
fore clear that the return to E; is a recurrent event as defined in chapter 
XIII. If the system starts from another state E;, then the passage 
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through EL; becomes a delayed recurrent event as defined in chapter XIII, 
section 5. It should therefore be clear that Markov chains are but a 
special case of recurrent events; the only new feature is that we are 
dealing with many recurrent events simultaneously. 

Each state E; is characterized by its recurrence time distribution {Ὁ ). 
Here 5.59. is the probability that the first return to E; occurs at fine nN. 
Starting from the p{”, we can calculate the f” using ieeby Gus cae 
rence relations ὅ 


1) ἐξ (2) κ( 
‘ πὰ ΕΞ = Pij Dis 7) 329 


(5.1) (n) _ (1) ae (2). (n—2) (n—1) 
f; — fj 'p SI Dy Hay Dy 


which, of course, are ἘΣΣΙ a special case of the basic relation XIII(3.1) 
for recurrent events. The sum 


(5.2) C= 25 

; n=1 
1s the probability that, starting from ΕἾ), the system ever returns to Ej. 
The state E; 1s persistent if f; = 1; in this case the mean recurrence time is 


(5.3) μι = Don” 
n=1 
We shall call EL; a null state if μ; = ©. 5 
If the system starts at Κἰ;, the waiting time up to the first passage 
through E; has a distribution ff? where 


(5.4) P=ps, [Ὁ = rf - Dir "nf. 


Again this equation is not specific to Markov chains but is valid for 
arbitrary delayed recurrent events. Of course, if ΕἾ; cannot be reached 
from E,, then [ἢ ©) — Q forall n. In general, 


(5.5) - DIP 


is the probability that, starting from Εἰς, the system ever reaches E;. 
We can now summarize the basic facts proved in chapter XIII, 
sections 3 and 5, as follows: 


5 They state that the probability of a first return to Κ΄; at time n equals the proba- 
bility of a return at time τὸ minus the probability that the first return takes place 
at some time ν = 1, 2, ..., n—1 and is followed by a repeated return at time n. 
In the notation of XIII(3.1), we have pi? = un and f{” = fp. 
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(i) A state E; is transient if f; <1. N ecessary and sufficient for this 
οΌ iv ¢) 


is the condition that >) p « οὐ. Inthis case automatically’ Σ᾽ p& «ὦ 
n=l N=zl 
for each 1. 
(il) A state ΕἸ; is a persistent null state if f; = 1, but the mean recur- 
rence time μ; = οο, Necessary and sufficient for this is the condition that 


> pi ) = ὦ but pe — 0. In this case® 


n=l 


(5.6) py 30 as n> ow 


for each 1. 

(iti) The state ΕἾ; has period t > 1 af p = 0 whenever n is not divisible 
by t and t 18 the smallest integer with this property (1.6., a return to E; is 
impossible except, perhaps, in t, 2t, 3t, ... steps). 

(iv) lf H; is persistent and aperiodic (not periodic), then ® 


(5.7) py > wf; as n - ὦ 
and, in particular, 
(5.8) pe — py; as n — οο. 


(If E; ts a null state, set μ; } = 0.) 
(v) If E; is persistent and has period t, then (5.8) ts to be replaced by 


(5.9) pr d+ ἐμ; as n — o, 


Persistent states which are neither periodic nor null states will be called 
ergodic.’ 


Examples. (a) Consider the matrix P’ of the example in section 4 
(omitting the dashes). The state EH, being absorbing, is persistent. 
From £2 the system necessarily passes into E3 and from there back 
into Ha. Therefore EH, and EH; are persistent states, with period 2 and 
mean recurrence time 2. The states E,, Ες, Eg form a closed subset, 


δ This follows trivially from (5.4) but is really a special case of the theorem in 
chapter XIII, section 5. 

7 Unfortunately, no generally accepted terminology exists. In the first edition 
the persistent states were called recurrent, which causes confusion by obscuring the 
parallelism between Markov chains and -recurrent events. Kolmogorov calls 
transient states unessential, but new research has shown that the main interest, 
both theoretical and practical, centers on transient states. The term ergodic, | 
being synonymous with “persistent, non-null, non-periodic,”’ is rather generally 
accepted, but “‘positive’’ state is one of the existing alternatives, and sometimes 
“ergodic” is equated to persistent. 
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and the transitions between them are regulated by the last matrix 
shown in (4.1). It is clear that these states are persistent and non- 
periodic. (We shall see later on that in a finite Markov chain no per- 
sistent null states are possible.) 

From £7 a passage into one of these closed sets is possible, and then 
the system stays in that closed set forever. Therefore H7 is transient. 
From Ey the system passes into #7 and no return to Eg is possible; 
therefore Eg too is transient. Finally, starting from Es, the system 
will sooner or later pass into #7 or Eg, never to return. Accordingly 
Ἐπ, Es, Ερ are transient. 

(b) We recall that in an unrestricted random walk [example (2.e)] 
all states are persistent if p = g and are transient otherwise [see exam- 
ple (8.d)]. 

It is not always easy to decide whether or not a given state is per- 
sistent, and the criterion that Dp” should diverge is usually too diffi- 
cult to apply. A better criterion is contained in the theorem of the 
next section. 

Let E; be a fixed persistent state and H; some other state which can 
be reached from it. Furthermore, let N be the length of the shortest 
possible path from EZ; to E;, and put pf? = a> 0. Areturnfrom EF, 
to E; must have positive probability, for otherwise the probability of 
the system’s not returning to E; would be at least a, andf; << 1—a <1 
contrary to the assumption that E; is persistent. It follows that there 


exists an index M such that pi” =8> 0. Now for any n we have 
obviously 

(5.10) Dp > ΡΠ De PE = oB- Pe 

and | 

(5.11) pie > pO Dy Dh? = oB- pi. 


These relations imply that the sequences p{” and p\? have the same 


asymptotic behavior, and from this we can draw important conclu- 
sions. To begin with, H; was assumed persistent, and therefore the 
series Dp{” diverges. From (5.11) it follows that also =p{% diverges, 
so that E; must be persistent. If pi — 0, then also ρὲ — 0, and 
vice versa. Finally, suppose that #; has period ¢. Since a return to 
E; is possible in N + M steps, N + M must be a multiple of ἐ. It 
follows then from (5.10) and (5.11) that #; and E; must have the same 
period. | 

_ We see thus that from a persistent state only persistent states can be 
reached, and they are all of the same type: Either they are all null states, 
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or all ergodic, or all periodic non-null states with the same period. The 
closure C of a persistent state E; is an irreducible set, and its submatrix 
defines a Markov chain on it which can be treated independently of 
the rest. We have thus proved the important 


Theorem. In an irreducible Markov chain all states belong to the 
same class: they are all transient, all persistent null states, or all persistent 
non-null states. In every case they have the same period. Moreover, every 
state can be reached from every other state. 

In every chain the persistent states can, Mm a unique manner, be divided 
into closed sets C1, Co, ... such that from any state of a given set C, all 
states of that set and no other can be reached. All states belonging to the 
same closed set C, are necessarily of the same class. 

In addition to the closed sets C, the chain will in general contain tran- 
stent states from which states of the closed sets C, can be reached (but not 
vice versa), 


This theorem has the interesting 


Corollary. In a finite Markov chain there exist no null States, and it 
ts ampossible that all states are transient. 


Proof. It suffices to consider irreducible chains. If all states were 
either transient or null states, we would have p® — 0 asn - 9 © for 
each fixed pair j,k. Each row of P” would tend to zero while the row 
sums equal unity. This is clearly impossible in the case of finitely 
many terms, and we conclude that in an irreducible chain there exist 


neither transient nor null states. 


It follows that after an appropriate renumbering of the states (such 
as was used in the example of section 4) the matrix P corresponding to 
a chain with, say, two closed sets ΟἹ and C> and additional transient 
States can be written schematically in the form of a partitioned matrix 


P, 0 0 
(5.12) P=|0 P, 0 
A BOC 


where P; and Pz are the matrices of transition probabilities within the 
two closed sets. The matrix P” is then of the same type with P,, Po, 
C replaced by P,”, Po", C™ (and A and B by more complicated matrices 
to be studied in section 8). Note that P;, Po, and C are square matrices, 
but A and B may be rectangular matrices as in the example. 
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6. ERGODIC PROPERTIES OF IRREDUCIBLE CHAINS 


In this section we restrict the discussion to aperiodic chains; as with 
recurrent events in general, the modifications required for periodic 
chains are rather trite, but the formulations become unpleasantly in- 
volved. 


Definition. A probability distribution {v,} 18 called stationary if 
(6.1) 0) = pa UsPij. 


If the initial distribution {a,} happens to be stationary, then the 
absolute probabilities {a\”} are independent of the time n, that is, 
αἰ = ay. The physical significance of stationarity becomes apparent 
if we imagine a large number of processes going on simultaneously. 
Let, for example, N particles perform independently the same type of 
random walk. At time n the expected number of particles in state 19 
is Na™. With a stationary distribution these expected numbers re- 
main constant, and we observe (if N is large so that the law of large 
numbers applies) a state of macroscopic equilibrium maintained by a 
large number of transitions in opposite directions. Most statistical 
equilibria in physics are of this kind; that is, they are due exclusively 
to the simultaneous observation of many independent particles. Typi- 
cal is the case of a symmetric random walk (or diffusion): if many par- 
ticles are observed, then, after a sufficiently long time, roughly half of 
them will be to the right, the other to the left of the origin. Neverthe- 
less, we know from the arc sine law of chapter III, section 5, that the 
majority of the particles individually will misbehave and spend a dispro- 
portionately large part of the time on the same side of the origin. 
Many protracted discussions and erroneous conclusions could be 
avoided by the realization that the notion of statistical equilibrium (or 
the steady state) does not say anything concerning the behavior of the 
individual particle. This should be borne in mind in connection with 
the next theorem which is frequently described as asserting a ‘“‘tend- 
ency toward equilibrium.” 


Theorem. An irreducible aperiodic Markov chain belongs to one of 
the following two classes: | 

(a) Either the states are all transient or all null states; in this case 
pe — 0 as n — © for each pair j,k and there exists no stationary 
distribution. 

(Ὁ) Or else, all states are ergodic, that ts 


(6.2) lim pf? = u, > 0 


nr— ὦ 
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where uz 18 the reciprocal of the mean recurrence time of Ey. In this case 
{uz} is a stationary distribution and there exists no other stationary dis- 
iribution. 


A slight reformulation may explain the implications of this theorem. 
If (6.2) holds, then for an arbitrary initial distribution {ax} 


(6.3) ay” = ΣῚ ap? > uy. 
j 


Therefore: If there exists a stationary distribution, it is necessarily unique 
and the distribution at time n tends to it irrespective of the initial distribu- 


tion. The only alternative to this situation is that p — 0. 


Proof. The preceding theorem assures us that (6.2) holds whenever 
the states are ergodic. To prove assertion (b) above we first note that 


(6.4) Zu, <1. 


This follows directly from the fact that for fixed j and n the quantities 
pe (k = 1, 2, ...) add to unity, so that τῳ + τ +.. -+ un <1 for 
every N. Now put n = 1 in (8.3) and let m — o. The left side 
tends to u,, and the general term of the sum on the right side tends to 
U,yPyx. Adding an arbitrary finite number of terms, we see that 


(6.5) Uk > >. UyDPyk- 


Summing these inequalities over all k, we obtain the finite quantity 
Zu, on each side. This shows that in (6.5) the inequality is impossible 
and therefore 


(6.6) Uk = Σιι;},κ. 


Putting 0, = u;,-(Zu;)—* we see that {v,} is a stationary distribution 
and hence at least one such distribution exists. 

Let {v,} be any distribution satisfying equations (6.1). Multiplying 
(6.1) by ρὲ" and adding over j we see by induction that for each n 


(6.7) υ, = Σ)υρίῃ͵ 
Letting n — © we get 
(6.8) Ur = (v, + UD) + ee . Uy = Ur. 


This completes the proof of assertion (b). If the states are transient 
or null states and {v;} is a stationary distribution, then equations (6.7) 
hold and p% -- 0, which is clearly impossible. Accordingly, a sta- 
tionary distribution can exist only in the ergodic case, and the proof 
is completed. 
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Examples. (a) The Ehrenfest model. In example (2.f), the condi- 
tions (6.1) for a stationary distribution take on the form 


k-1 k+1 
UE = 1 -- ἢ Up—1 + fi ἐκ. (k = 1, ...,(ᾳ 6-ὖὶϑ 1) 


It is easily verified that the solution is given by the binomial distribu- 
tion 


(6.10) = () 2-4, 


This result can be interpreted as follows: Whatever the initial number 
of molecules in the first container, after a long time the probability of 
finding k molecules in it is nearly the same as if the a molecules had 
been distributed at random, each molecule having probability 4 to be 
in the first container. This is a typical example of how our result gains 
physical significance. 

For large a the normal approximation to the binomial distribution 
shows that, once the limiting distribution (6.10) is established, we are 
practically certain to find about one-half the molecules in each con- 
tainer. To the physicist a = 10° is a small number, indeed. But even 
with a = 10° molecules the probability of finding more than 505,000 
molecules in one container (density fluctuation of about 1 per cent) is 
of the order of magnitude 10773. With a = 10° a density fluctuation 
of one in a thousand has the same negligible probability. It is true 
that the system will occasionally pass into very improbable states, but 
their recurrence times are fantastically large as compared to the recur- 
rence times of states near the equilibrium. Physical irreversibility 
manifests itself in the fact that, whenever the system is in a state far 
removed from equilibrium, it is much more likely to move toward equi- 
librium than in the opposite direction. 

(b) Doubly stochastic matrices. The matrix P is called doubly sto- 
chastic if not only the row sums but also the column sums are unity. 
Suppose that the chain contains only a finite number, a, of states. 
The system (6.1) has then obviously the solution v», = 1/a. It follows 
that, if a finite irreducible aperiodic chain has a doubly stochastic matrix 
P, then vy = 1/a (1.e., in the limit all states become equally probable). In 
this case no transient states are possible. By contrast, in an irreducible 
— infinite chain with doubly stochastic matrix all elements are either transient 


XV.6] ERGODIC PROPERTIES OF IRREDUCIBLE CHAINS 359 


or null states. To prove this assertion suppose that (6.2) holds. The 
matrix P being doubly stochastic, we have for each fixed k and arbi- 
trarily large N 


ire) N 
(6.11) 1= >) oP > Xp? = Ny 
j=] El 


and this clearly implies u, = 0 against the assumption. (This proof 
applies also in the periodic case.) 

(c) Recurrent events. In example (2.7) we have introduced a Markov 
chain associated with an arbitrary recurrent event δ, and we proceed 
now to show that (as could be expected) the states of the chain are 
always of the same type as &. . 

First consider the case of a transient 8; that is, suppose f <1. The 
chain of transitions Κ᾽; > Ej41 -ῷ Ejy2 3... E;4n has probability 


1 — 8; 1 —s; 1 ~— Sian 1 — Sin 1 -- 
(6.12) ἀάεε ἐς σους ἐξα δυο τ θεοὺς, f 
l—s; 1 = 844 a nee 1 -- 5; 1 -- 8; 


The probability that the system will never enter Ey is thus seen to be 
positive, and therefore all states are transient. On the other hand, 
when f = 1, the left-hand term in (6.12) tends to zero; with probability 
one the system will sooner or later pass through Ep. It follows that Eo 
is persistent, and since every state can be reached from Eo, the chain 
is irreducible. We see thus: If & is transient, so are all states of the 
chain; uf ὃ is persistent, then the chain is irreducible and all states are 
persistent. 

It is clear that the chain and 8 have the same period, and we shall 
suppose that & is aperiodic and persistent. We have to decide whether 
there exists a stationary distribution, that is, a probability distribution 
{v,} satisfying (6.1). In the present case (6.1) reduces to 


= t ] - 8 
(6.18) Vp = yt, Vv, = - 


ἌΣ eee On 1: 
ἴ..-01 -- 8; 1 — θὲ. 


There exists a unique solution of these equations, namely 


ir a) 
(6.14) VE = (1 — S%)v9 = Τεῦρ where fo pe 
n=k-+-1 
In order that Zu, < © it is necessary and sufficient that ΣΥ» < «©, But 
Zr, = Xnf, equals the mean recurrence time [cf. XI(1.8)]. This shows 
that the states of the Markov chain are null states if the mean recurrence 
time is infinite, and they have finite mean recurrence times of ὃ has. 
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We have derived the asymptotic properties of Markov chains from 
similar properties of recurrent events. Now we have shown that each 
recurrent event may be described in terms of a particular Markov 
chain. The three topics, asymptotic behavior of Markov chains, asymp- 
totic behavior of recurrent events (i.e., summation theory of integral- 
valued independent random variables), and renewal theory, are there- 
fore different versions of the same analytic background and are sub- 
stantially equivalent. 


In concluding this section, let us remark that it is usually compara- 
tively easy to decide whether a stationary distribution exists and hence 
whether a given irreducible chain is ergodic. In section 8 we shall de- 
rive a similar criterion to discriminate between transient states and 
persistent null states. In principle this could be decided by discussing 
the convergence of the series =p‘, but in practice this question cannot 
be attacked directly. 


ἘΠ PERIODIC CHAINS 


In the preceding section we have excluded the case of periodic chains, 
but this was done only to avoid obscuring salient facts by complicated 
descriptions. A characterization of the asymptotic behavior of py in 
irreducible periodic chains can be derived easily from the theorems of 
the preceding sections. We give such a derivation for the sake of com- 
pleteness, but the results of this section will not be used in the sequel. 

By the theorem of section 5 all states of an irreducible chain have the 
same period ¢. Consider any two states EH; and HE, of an irreducible 
chain with period ¢t. Since every state can be reached from every 
other, there exist integers a, ὃ such that p > 0 and py > 0. Now 
pet? > pe pe which shows that a return to ΗΕ; in a+ steps is pos- 
sible, so that a + b is necessarily divisible by the period ¢. It follows 
that, if Εκ can be reached from E£; in a; and in ag steps, ὧς — αι must 
be divisible by ἐ, and hence a division of a; and ας by ¢ will leave the 
same remainder. 

Accordingly, for fixed E; each state Ε΄, belongs to a certain remainder 
y (where 0 < »v < ¢ — 1) such that a transition from E; to E;, is pos- 
sible only in v, ν- Εἰ, v-+2t, v+-3t, ... steps. Choosing 7 = 1, we get a 
classification of all states into ¢ groups Go, G1, ..., Gi—1 so that ἔκ 
belongs to G, if p > 0 implies that a = ν + nt. We order the G, 
cyclically so that Gp and G:_; become neighbors. 

It follows in particular that a one-step transition from a state in G, 
will always lead to a state in the next following group G,+1 (or Go in 


* This section treats a special topic and should be omitted at first reading. 
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case v = ἡ — 1); a two-step transition will lead to a state in G,12 (from 
σι... it leads to Go, from G,_1 to 61), etc. Finally, a t-step transition 
leads necessarily to a state belonging to the same group. This means 
that, in a Markov chain whose matrix of transition probabilities is P’‘, 
each group G, forms a closed set. Since the original chain is irreducible, 
each state can be reached from every other. This implies that in the 
chain with transition probabilities P‘ each G, forms an irreducible closed 
set. We have thus the 


Theorem. In an irreducible periodic Markov chain the states can be 
dwided into t groups Go, ..., Gi_1, 80 that a one-step transition from a 
state of G, always leads to a state of G41 (to Go ifv=t—1). If we 
consider the chain only at times t, 2t, 3t, ..., then we get a new chain 
whose matrix of transition probabilities is P*'. In it each G, forms an 
erreducible closed set. 


Our theorem contains complete information concerning the asymp- 
totic behavior of pif’. If all states are transient or null states, then 
pi — 0 for every pair j,k. Otherwise each state Εἰ has a finite mean 
reculrence time με. Suppose that EH; belongs to G,. On G, we have an 
irreducible non-periodic Markov chain with transition probabilities p?, 


d 
and hence there exist the limits 


Uk if # k is in G, 
7.1 lim p&% = 
( ) n— Στ 0 otherwise 


where wu; is the reciprocal mean recurrence time of E; in the new chain, 
one step of which corresponds to ¢ steps of the original chain. Thus 


ΐ 
(7.2) “Uz: 
Mk 


Using (3.2), we find from (7.1), 


| up —sif Ey is in Gray 
7.3) lim pty) — 
( Bree τ 0 otherwise. 


Similarly, p{’t? — wu, if Ey isin G4», etc. In other words, for fixed 
EK; and Ey the sequence pp ws asymptotically periodic; in it blocks of 
t — 1 consecutive zeros alternate with a positive element which converges 
to Un = t/ pr. 

By the theorem of section 6, the τ within each group G, add to unity. 
Since there are ¢ blocks, it follows from (7.2) that the sequence {1/y;} 
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represents a probability distribution. The argument of section 6 
shows directly that this distribution is stationary and that no other sta- 
tionary distributions exist. 


8. TRANSIENT STATES 


From a persistent state H; the system can pass only into a persistent 
FE; in the closure of ΕἾ), and we have obtained complete information 
concerning the asymptotic behavior of pf in this case. 

If H; is transient and EF; ergodic, then by (5.7) 


(8.1) | De — we hin 


where μὰ 15 the mean recurrence time of EL; and ἔκ the probability 
that, starting from #;, the system will sooner or later enter ἔχ. How- 
ever, HE; belongs to an irreducible subchain C’, and from E;, the system 
is bound to pass through each state of C. Therefore, for each fixed j 
the probability fj is the same for all states of C. In other words, if Οἱ 
is an irreducible subchain with ergodic states and E; is transient, then 
for each Ej, of C 


(8.2) py — py ta; 


where x; 1s the probability that, starting from E;, the system will ever 
enter C. Needless to say that for null states the right-hand side in 
(8.2) must be replaced by 0, and that the case of periodic /;, necessi- 
tates only the usual routine modification. 

To complete the picture of the asymptotic behavior of Markov 
chains, it remains to solve the following 


Three Problems. (a) Given a transient state Εἰ; and a persistent 
closed set C, find the probability x; that, starting from E;, the system will 
ever enter C' (1.e., pass through a state of C). 

(b) Find the probability y; that the system will forever remain in the 
set of transient states. 

(c) Given an irreducible chain, decide whether tts states are transient or 
persistent. 

It will be seen presently that, after a slight reformulation, problem 
(c) becomes a special case of (a). 

Let 7 be the set of all transient states and suppose that the system 
is initially in the transient state E;; let x” be the probability that at time 
n, and not sooner, the system reaches the closed setC. Then 


00 


(8.3) aj = >) οἷν 


n=] 
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is the probability that the system will ultimately reach C and stay in C. 
By analogy with the simple random walk we shall call x; the probability 
of absorption in C. The difference 1 — x; accounts for the possibility 
of absorption in other closed sets and (in the case of some infinite 
chains) of an indefinite continuation in transient states. 

It is clear that 


(8.4) 2) = >" pin, 
C 


the summation extending over those & for which E; is contained in C. 
If the system reaches C at the (n+1)st step, then the first step must 
lead from £; to another transient state. It is therefore clear that 


(8.5) at) os =) Diyas” ’ 


the summation now extending over those ν for which FE, is transient. 
Equations (8.4) and (8.5) are recurrence relations which uniquely deter- 
mine the x}. Adding (8.5) forn = 1, 2,3, ..., we find that the absorp- 
tion probabilities x; are solutions of the wate of linear equations 


(8.6) | 4) τῶ Pivly = 1... 
T 


We have thus an answer to problem (a); the probability x; can be 
obtained constructively from (8.3)—(8.5), but it is preferable to char- 
acterize it as a solution of the system of linear equations (8.6). In 
this connection the problem of uniqueness arises, but it fortunately 
turns out to be a special case of problem (6). 

Let y$” be the probability that the system is at time n in a transient 
state. Obviously 


pee > Divs 
T 
(8.7) 
ut? = Σ ρρυδν 


the summations again extending over all ν for which EZ, is transient. 
It follows from (8.7) that y” < 1 and hence y < y™, and generally 
yt) <y™. Therefore a limit 


(8.8) y; = lim y” 


N—> © 


exists; y; is the probability of the system’s forever staying in transient 
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states. From (8.7) we have 


(8.9) Uj = Do Ῥινῦν. 
T 


The probabilities y; are thus seen to satisfy equations (8.9), but this 
does not solve the main question, namely, whether or not y; = 0 for 
all 7. Suppose that there exists a bounded solution of the system (8.9), 
say 


(8.10) 25 = DS Ῥηνᾶν els. 
T 


A comparison of (8.10) and (8.7) shows then that |z;| << y{? and hence 
by induction |z;|< y{” for all n. It follows that y; = 0 for all 7 if, 
and only if, the system (8.10) has no non-zero solution. Finally, if 
the linear equations (8.6) had two distinct solutions, their difference 
would be a solution of the linear equations in (8.10). We have thus 


Theorem 1. The probabilities x; of problem (a) are a solution of the 
linear equations (8.6). This solution 18 unique except when there exists a 
state E; such that, starting from E,, the system has a positive probability 
y; of staying forever in the transient states. The {y;} satisfy (8.9). 


Note: We have seen that the probabilities y; may be characterized 
as the maximal solution of (8.9) bounded by 1; a similar property 
attaches to {z;}. 


Corollary. In a finite Markov chain the probability of the system’s 
staying forever in the transient states 1s zero. The probabilities x; of pass- 
ing from a transient E; into a closed set C are determined as the unique 
solution of the linear equations (8.6). 


Proof. We have to prove that the equations (8.9) admit of no solu- 
tion. Suppose the contrary and let M be the maximum of the finitely 
many y;. There is no loss of generality in ordering the states so that 
the y; appear in decreasing order, say that y; = yg =...= Ya = M > 
> Yori = Yor2 >---. From (8.9) we have then for ὁ < a 


a 
(8.11) M = De Ῥωῦν τ- Σ᾽ Div a 2; Pip, 
T pol y>a+l 

and the equality sign can hold only if p;, = 0 for each »y > a. In this 
case Hf, ..., E, form a closed set, and this is impossible since a, finite 
chain necessarily contains persistent states (corollary, section 5). 

Theorem 1 is used to calculate absorption probabilities, that is, the 
probabilities of entering a given absorbing state. | 
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Examples. (a) Random walk with absorbing barriers [example (2.0). 
Take for C the absorbing state Zp. Then z{” = gand2™ = Ο 17} 1. 
The system (8.6) therefore reduces to 


v1 — pg = (, 
(8.12) %— φῦ). — Ptjy1=0 (7 -Ξ 2,3, .. .,6--2), 
La—1 — QLa—2 = Ὁ. 


This is the same as the system XIV (2.1)—(2.2) and the solution is given 
in XIV(2.4). 

(Ὁ) Sequential sampling [example (2.j)]. Again let C be the state Eo. 
Then x} = r;, and the equations (8.6) reduce to XIV(8.2) (where τὸς 
stands for the present x;; cf. also problem XIV, 4). 

(c) Genetics [example (2.k)]._ Here each of the two states Ey and 
Ean forms a closed set. Absorption in Zp and in Foy signifies, respec- 
tively, that the population ultimately consists only of aa- or only of 
AA-individuals. For the absorption in Ep we have x{? = pj = 
= (1 — j/2N)?%, and hence (8.6) assumes the form 


2N—1 9 j » 7 2N—» j 2N 
v0 EOE (2-0-2 
(8.13) %~ 2. ) lay ON ON 
It is plausible that at a moment when the A- and a-genes are in the 
proportion j:2N — 7 their survival chances should be in the same ratio. 
If this is true, the solution to (8.13) must be z; = 1 — j(2N)—. That 


these x; really satisfy (8.13) is easily verified upon recognizing in (8.13) 
the terms of the binomial distribution with mean j. 


Finally we give a solution to problem (c). 


Theorem 2, Let an irreducible chain have states Eo, Fy, .... In 
order that the states be transient, it ts necessary and sufficient that the sys- 
tem of equations 


(8.14) Yi = Do Diss, t=1,2,... 
j=l 


admits of a non-zero bounded solution. 


Proof. In the construction (8.7)—(8.8) of {y;} interpret Τ' as the set 
of the states Hy, Hz, ... (the complement of Zo). The proof applies 
without change to this case, and it is seen that the probability of stay- 
ing in T (1.6., of not entering Ho) is given by (8.8) and satisfies (8.9). 
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Examples. (d) Unrestricted random walk. Example (2.e) requires 
a trivial change of notations since the states are numbered from — 
to +o, Itis clear, however, that our criterion depends on the existence 
of solutions of the equations 


(8.15) Y= Min tay, Yyou0, += +1, 42,.... 
Clearly all y; can be calculated recursively from y; and y_;. If p > q, 


(8.16) σὲ = {1 = (2) Y1) y—i = 0, 1=1,2,. 


is the unique solution and is bounded. If Ὁ =q, the solution is 
y; = ty; and is unbounded. We have here a Markov chain derivation 
of the old result that the states are transient if p τέ gq, persistent if 


(e) Consider the matrix 
Gd Po 9 0 0 
ᾳφ O pi 0 
0 g 0 Pe 
0 0 g O pg 


Oo © 


which represents a random walk on (0, ©) with variable transition 
probabilities. It plays an important role in the theory of birth-and- 
death processes to be discussed in chapter XVII. The equations (8.14) 
reduce to 


(8.17) Yi = P1y2; Yi = Wi-r + DYia, a= 2,3,... 


and can be solved recursively since 


(8.18) ὕει  s _ 0 
Yi Yi-1— Di 
and hence 
61 go qi 
(8.19) Yi4+1 - Yi = Y1 oo ee .---.. 
P1 32 Pi 


Adding these equations, we see that a bounded solution exists if, and 
only if, ΣΙ; < «© where L; = (σφι «+: qi)(p1 «+: pi)”. Therefore, the 
states are transient if ΣΙ; < © and persistent otherwise. 
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9. APPLICATION TO CARD SHUFFLING 


A deck of N cards numbered 1, 2, ..., N can be arranged in N! dif- 
ferent orders, and each represents a possible state of the system. Every 
particular shuffling operation effects a transition from the existing 
state into some other state. For example, “cutting” will change the 
order (1,2,...,N) into one of the N cyclically equivalent orders 
(7,r+1,...,N,1,2,...,r—1). The same operation applied to the 
inverse order (V, N—1, ..., 1) will produce (V—r+1, N—r, .. ΤῸΝ, 
Ν--1, ..., N—r+2). In other words, we conceive of each particu- 
Jar shuffling operation as a transformation ΚΗ; — E,. If exactly the 
Same operation is repeated, the system will pass (starting from the 
given state H;) through a well-defined succession of states, and after 
a finite number of steps the original order will be re-established. From 
then on the same succession of states will recur periodically. For most 
operations the period will be rather small, and in no case can all states 
be reached by this procedure. For example, a perfect “lacing” would 
change a deck of 2m cards from (1, ..., 2m) into (1, m+1, 2, M+2,..., 
m, 2m). With six cards four applications of this operation will re-estab- 
lish the original order. With ten cards the initial order will reappear 
after six operations, so that repeated perfect lacing of a deck of ten 
cards can produce only six out of the 10! = 3,628,800 possible orders. 

In practice the player may wish to vary the operation, and at any 
rate accidental variations will be introduced by chance. We shall 
assume that we can account for the player’s habits and the influence 
of chance variations by assuming that every particular operation has 
a certain probability (possibly zero). We need assume nothing about 
the numerical values of these probabilities but shall suppose that the 
player operates without regard to the past and does not know the order 
of the cards.’ ‘This implies that the successive operations correspond 
to independent trials with fixed probabilities; for the actual deck of 
cards we then have a Markov chain. 

We now show that the matrix P of transition probabilities is doubly 
stochastic [example (6.b)]. In fact, if an operation changes a state 
(order of cards) E; to Ey, then there exists another state E, which it 
will change into Κ΄. This means that the elements of the jth column 


8 In the language of group theory this amounts to saying that the permutation 
group is not cyclic and can therefore not be generated by a simple operation. 

9 This assumption corresponds to the usual situation at bridge. It is easy to 
devise more complicated shuffling techniques in which the operations depend on 
previous operations and the final outcome is not a Markov chain [cf. example 
(10.e)]. 
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of P are identical with the elements of the jth row, except that they 
appear in a different order. All column sums are therefore unity. 

It follows that no state can be transient. If the chain 18 irreducible 
and aperiodic, then in the limit all states become equally probable. In 
other words, any kind of shuffling will do, provided only that it pro- 
duces an irreducible and aperiodic chain. It is safe to assume that this 
is usually the case. Suppose, however, that the deck contains an even 
number of cards and the procedure consists in dividing them equally 
into two parts and shuffling them separately by any method. If the 
two parts are put together in their original order, then the Markov 
chain is reducible (since not every state can be reached from every 
other state). If the order of the two parts is inverted, the chain will 
have period 2. Thus both contingencies can arise in theory, but hardly 
in ‘practice, since chance precludes perfect regularity. 

It is seen that continued shuffling may reasonably be expected to 
produce perfect “randomness” and to eliminate all traces of the original 
order. It should be noted, however, that the number of operations 
required for this purpose is extremely large.” 


10. THE GENERAL MARKOV PROCESS 


In applications it is usually convenient to describe Markov chains 
in terms of random variables. This can be done by the simple device 
of replacing in the preceding sections the symbol E; by the integer k. 
The state of the system at time n then is a random variable X™, which 
assumes the value k with probability af”; the joint distribution of x” 
and Xt) is given by P{X™ = 7, ΧΟ ΤῸ) = k} = αἴρῃ, and the 
joint distribution of (KX, ..., X) is given by (1.1). It is also possible 
and sometimes preferable to assign to Εἶμ a numerical value e; different 
from k. With this notation a Markov chain becomes a special sto- 
chastic process,! or in other words, a sequence of (dependent) random 
variables 2 (ΧΟ, ΧΟ, ...). The superscript n plays the role of 


10 For an analysis of unbelievably poor results of shuffling in records of extra- 
sensory perception experiments, see W. Feller, Statistical aspects of ESP, Journal 
of Parapsychology, vol. 4 (1940), pp. 271-298. In their amusing A review of Dr. 
Feller’s critique, ibid., pp. 299-319, J. A. Greenwood and C. E. Stuart try to show 
that these results are due to chance. Both their arithmetic and their experiments 
have a distinct tinge of the supernatural. 

11 The terms “stochastic process” and “random process” are synonyms and cover 
practically all the theory of probability from coin tossing to harmonic analysis. 
In practice, the term “stochastic process’ is used mostly when a time parameter 
is introduced. 

2 This formulation refers to an infinite product space, but in reality we are 
concerned only with joint distributions of finite collections of the variables. 
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time. In chapter XVII we shall get a glimpse of more general stochas- 
tic processes in which the time parameter is permitted to vary continu- 
ously, The term “Markov process” is applied to a very large and im- 
portant class of stochastic processes (with both discrete and continuous 
time parameters). Even in the discrete case there exist more general 
Markov processes than the simple chains we have studied so far. I+ 
will, therefore, be useful to give a definition of the Markov property, 
to point out the special condition characterizing our Markov chains, 
and, finally, to give a few examples of non-Markovian processes. 

Conceptually, a Markov process is the probabilistic analogue of the 
processes of classical mechanics, where the future development is com- 
pletely determined by the present state and is independent of the way 
in which the present state has developed. The processes of mechanics 
are in contrast to processes with aftereffect (or hereditary processes), 
such as occur in the theory of plasticity, where the whole past history 
of the system influences its future. In stochastic processes the future 
is never uniquely determined, but we have at least probability relations 
enabling us to make predictions. For the Markov chains studied in 
this chapter it is clear that probability relations relating to the future 
depend on the present state, but not on the manner in which the pres- 
ent state has emerged from the past. In other words, if two independ- 
ent systems subject to the same transition probabilities happen to be 
in the same state, then all probabilities relating to their future develop- 
ments are identical. This is a rather vague description which is for- 
malized in the following 


Definition. A sequence of discrete-valued random variables is a 
Markov process 1f, for every finite collection of integers ny < nog <...< 
<n, <n, the joint distribution of (ΧΟ), X™, ..., KX), ΧΗ is defined 
im such a way that the conditional probability of the relation X™ = x on 
the hypothesis XP = 2, ..., XK = x, is identical with the conditional 
probability of X™ =x on the single hypothesis X™ = 2,. Here 
Z1, +++, Ly, are arbitrary numbers for which the hypothesis has a posi- 
tive probability. 


Reduced to simpler terms, this definition states that, given the state 
zt, at time n,, no additional data concerning states of the system at 
previous times can alter the (conditional) probability of the state z at 
a future time n. 

The Markov chains studied in this chapter are obviously Markov 
processes, but they have the following additional property not implied 
by the definition. For the Markov chains studied in the preceding sec- 
tions the transition probabilities pj, = P{Xt) = k|X™ = ἢ) are in- 
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dependent of m. The more general transition probabilities 
(10.1) pie = PIR Sh xX) (m <n) 


then depend only on the difference n — m. We say in this case that 
the transition probabilities are stationary (or constant). For a general 
integral-valued Markov chain the right side in (10.1) depends on m 
and n. We shall denote it by p;,(m, n) so that p;,(n, n-+1) is the one- 
step transition probability at time n. Instead of (1.1) we get now for 
the probability of the path (Jo, 71, ..-,Jn) the expression 


(10.2) a) Digi, (0, 1) Ρη.β.(1, 2) ὁ Pi,13,(0 —1, n). 
The proper generalization of (3.3) is obviously the identity 
(10.3) pil, n) = Di pivlm, r) Ῥνκίσ, 2) 


which is valid for all r with m <r<n. This identity follows directly 
from the definition of a Markov process and also from (10.2); it is 
called the Chapman-Kolmogorov equation. 

In the present chapter we have dealt mostly with the asymptotic 
behavior of the higher transition probabilities, and few of the estab- 
lished properties are common to the most general discrete Markov 
process. We shall, therefore, not dwell on the general theory. 


Examples of Non-Markovian Processes. (a) The Polya urn 
scheme [example V(2.c)]. Let X™ equal 1 or 0 according to whether 
the nth drawing results in a black or red ball. The sequence {X™} is 
not a Markov process. For example, 


P{x® = 1/X® = 1} = (6+0c)/6b+r+o), 
but 


P{X® = 1)X@ = 1,X% = 1} = (6+ 2ο)) +7 + 20). 


(Cf. problems V, 19-20.) On the other hand, if Y“ is the number of 
black balls in the urn at time n, then {Y™} is an ordinary Markov 
chain with constant transition probabilities. 

(b) Higher sums. Let Yo, Yi, ... be mutually independent random 
variables, and put S, = Yo +...+ Y,. The difference S, — S,, (with 
m <n) depends only on Ynii, ..., Yn, and it is therefore easily seen 
that the sequence {S,} is a Markov process. Now let us go one step 
further and define a new sequence of random variables U, by Ὁ, = So + 
+S, +...+S, (which means that 


U, = Yn + 2Yn-1 + 3Yn_2 tis se 
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The sequence {U,,} forms a stochastic process whose probability rela- 
tions can, in principle, be expressed in terms of the distributions of the 
Y;. The {U,} process is in general not of the Markov type, since there 
is no reason why, for example, P{U, = 0|U,_1 = a} should be the 
same as P{U, = 0/Un_1 = a, Un_2 = δ}; the knowledge of U,_, and 
U,_2 permits better predictions than the sole knowledge of Ὁ,. 1. 

In the case of a continuous time parameter the preceding summations 
are replaced by integrations. In diffusion theory the Y,, play the role 
of accelerations; the S, are then velocities, and the Ὁ, positions. If 
only positions can be measured, we are compelled to study a non- 
Markovian process, even though it is indirectly defined in terms of a 
Markov process. : 

(c) Moving averages. Again let {Y,} be a sequence of mutually 
independent random variables. Moving averages of order r are defined 
by X™ = (Y, + Ynti +...+ Yn4r_1)/r. It is easily seen that the 
X) are not a Markov process. Processes of this type are common in 
many applications (cf. problem 26). 

(d) A traffic problem. For an empirical example of a non-Markovian 
process R. Fiirth 15 made extensive observations on the number of 
pedestrians on a certain segment of a street. An idealized mathematical 
model of this process can be obtained in the following way. For 
simplicity we assume that all pedestrians have the same speed υ; also, 
we consider only pedestrians moving in one direction. At time ἐ = 0 
we divide the positive z-axis into segments of fixed length ὃ, each of 
which may or may not contain a pedestrian. We suppose that the 
distribution of pedestrians in our segments is determined by a sequence 
of Bernoulli trials. In other words, we have a sequence of independent 
random variables Y;, each of which assumes the values 1 or 0 with 
probabilities p and g, respectively. The segment (k — 1)ὃ <x < ko 
contains a pedestrian if Y, = 1. Let now the whole axis move with 
velocity v in the negative direction, and let us observe the number of 
pedestrians in the fixed interval of length N6, which at time t = 0 is 
covered by the interval 0 < x < Νὲὸ of the moving z-axis. At time f 
this fixed interval is covered by the interval vt < x < vt + Νὲ of the 
z-axis. Let observations be made at times né/v and let X™ be the 
number of pedestrians in our fixed interval observed at time n. Then 
ΧΟ = Yn + Yaui +...+ Yniy_1, so that our process is, except for 
the factor 1/N, a moving average process. It is therefore non-Mar- 


*R. Furth, Schwankungserscheinungen in der Physik, Sammlung Vieweg, 
Braunschweig, 1920, pp. 17ff. The original observations appeared in Physikalische 
Zettschrift, vols. 19 (1918) and 20 (1919). 
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kovian. (Passing to the limit 6 — 0, we obtain a continuous model, 
in which a Poisson distribution takes over the role of the binomial 
distribution.) 

(6) Superposition of Markov processes (composite shuffling). There 
exist many technical devices (such as groups of selectors in telephone 
exchanges, counters, filters) whose action can be described as a super- 
position of two Markov processes with an output which is non-Markov- 
ian. A fair idea of such mechanisms may be obtained from the study 
of the following method of card shuffling. 

In addition to the target deck of N cards we have an equivalent 
auxiliary deck, and the usual shuffling technique is applied to this aux- 


iliary deck. If its cards appear in the order (a1, ao, ..., an), Wwe per- 
mute the cards of the target deck so that the first, second, ..., Nth 
cards are transferred to the places number aj, de, ..., av. Thus the 


shuffling of the auxiliary deck indirectly determines the successive 
orderings of the target deck. The latter form a stochastic process which 
is not of the Markov type. To prove this, it suffices to show that the 
knowledge of two successive orderings of the target deck conveys in 
general more clues to the future than the sole knowledge of the last 
ordering. We show this in a simple special case. 

Let N = 4, and suppose that the auxiliary deck is initially in the 
order (2431). Suppose, furthermore, that the shuffling operation 
always consists of a true “cutting,” that is, the ordering (a1, a2, a3, a4) 
is changed into one of the three orderings (ag, dg, a4, 41), (@3, 44, αι, G2), 
(a4, αι, ας, a3); we attribute to each of these three possibilities prob- 
ability 4. With these conventions the auxiliary deck will at any time 
be in one of the four orderings (2431), (4812), (3124), (1243). On the 
other hand, a little experimentation will show that the target deck will 
gradually pass through all 24 possible orderings and that each of them 
will appear in combination with each of the four possible orderings of 
the auxiliary deck. This means that the ordering (1234) of the target, 
deck will recur infinitely often, and it will always be succeeded by one 
of the four orderings (2431), (4312), (8124), (1243). Now the auxiliary 
deck can never remain in the same ordering, and hence the target deck 
cannot twice in succession undergo the same permutation. Hence, if 
at times n — 1 and n the orderings are (1234) and (1243), respectively, 
then at time n + 1 the state (1234) is impossible. Thus the knowledge 
of the state at times (n — 1) and n conveys more information than the 
sole knowledge of the state at time n. 


XV.11] MISCELLANY 373 


*11. MISCELLANY 


(a) Inverse Probabilities 


Although it is most natural to investigate the future development, of 
a system, it is occasionally necessary to study its past. Consider a 
Markov chain with states H;, and constant transition probabilities p;,, 
whose absolute probabilities at time n are af” = 2ap®. The con- 
ditional probability that the system was at time m <n in state E;, given 


that at tume n it ts in Ey, 18 (independently of the states at times after n) 


(m) 


a rm—™ 
(11.1) Qnj(n, m) = xo oo: m<n. 


This formula makes sense only if a{” > 0; otherwise the conditional 
probability in question is not defined. If all af” are positive, then 
(11.1) defines a system of transition probabilities with all the properties 
required for a Markov process. In particular, the g,;(n, m) satisfy the 
Chapman~-Kolmogorov identity (10.3) with the time direction reversed, 
namely, 
(11.2) qus(n, m) = D7 φινίηι, τ) φησ, m) 
(m<r<n). Theg:;(n, m) are called inverse probabilities. Consider, 
in particular, an irreducible chain with stationary probabilities {u;,}. 
Then ay” = ux for all n, and uz > 0 (cf. sections 6 and 7). In this 
case the one-step transitions g,,j;(n-+1,n) are independent of n and 
reduce to 

Uj 
(11.3) 4} = — Dix. 

Uk 
The matrix {q:;} is stochastic, so that here the inverse probabilities 
define a Markov chain with constant transition probabilities. If 
jk = Pjx, the original chain is called reversible; its probability relations 
are then symmetric in time. 


(b) The Central Limit Theorem 


The theory of recurrent events contains further information concern- 
ing Markov chains. Let E; be a fixed persistent state whose recurrence 
time has finite variance o;” (this condition is always satisfied if the 


* This section may be omitted at first reading. 
14 A. Kolmogoroff, Zur Theorie der Markoffschen Ketten, Mathematische Annalen, 
vol. 112 (1935), pp. 155-160. 
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chain is finite). Let N, denote the number of passages up to time n 
of the system through E,. Then we know from chapter XIII, section 
6, that the variable N,, is asymptotically normally distributed. In the 
notations of the present chapter we have E(N,) = 1/ux = uz; a way 
to calculate the variance in the case of finite chains will be indicated 
in the next chapter. In particular, as n — οὐ, the probability tends 


to one that < efor every arbitrary « > 0. Thisis the weak 


N,, 
n 


law of large numbers for the number of passages through E,. Similarly, 
the strong law of large numbers and the law of the iterated logarithm 
hold and require no special proof. In the case of an infinite chain, the 
recurrence time of EH; need not have a finite variance, even if its mean 
is finite. However, the general limit theorems for recurrent events 
apply in this case. 

The random variable N,, may be defined by N, = X, +...+ X,, 
where X, equals one if the system is, at time n, in state E;, and zero 
otherwise. This suggests the following generalization. We assign to 
the state Εκ an arbitrary number 2; and let the random variable X,, 
equal 2; if at time n the system is in state H,. As usual, we put 
S, = ΧΙ +...+X,. For finite Markov chains Doeblin 16 has shown 
that in general the central limit theorem and the law of the iterated 
logarithm hold for S,. An exception occurs only if the numbers 2; are 
chosen so that for every shortest path leading from EH; back to E; the — 
sum of the x, equals a constant c independent of the path. 


(c) Non-stochastic Matrices 


The theorems of this chapter describe the asymptotic behavior of 
the powers P” of an arbitrary stochastic matrix P, that is, of a matrix 
whose elements satisfy the conditions (1.2). It is easy to generalize 
these theorems to a more general class of matrices. Let P be an arbitrary 
(finite or infinite) matrix with non-negative elements and denote its row 
sums by S; so that S; = Zxpjx. We assume that the sequence S; 18 bounded, 
that is, that there exists a constant M such that S; < M. Under these 
conditions the asymptotic behavior of P” is still described by our theo- 
rems, inasmuch as P can be reduced to a stochastic matrix. 

To fix ideas suppose that the rows and columns of P are numbered 
starting with 1, and consider first the case where S; < 1 for all jz. In 
this case we enlarge (border) the matrix P by adding a row and a col- 


15 W. Doeblin, Sur les propriétés asymptotiques de mouvements régis par certains 
types de chaines simples, Thesis, Paris, 1937. 
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umn number zero whose elements are defined by poo = 1, poi = ρα = 
=...= 0, and pj = 1 —S;forj > 1. The new matrix Q is stochas- 
tic, and its asymptotic behavior is given by our theorems. On the 
other hand, P” is the submatrix of the corner element p% of Q*. In 
the general case the row sums S; may exceed unity, but we may replace 
the matrix P by the matrix P* whose elements are Djk/M. The row 
sums S;* of P* satisfy the condition S;* < 1, and we are able to de- 
scribe the asymptotic behavior of the powers P*”. However, the ma- 
trices P” and P*” differ only by the factor M”, so that our theorems 
actually describe the asymptotic behavior of py in all cases. 

Matrices of the described type occur in the theory of generalized 
random walks with creation or destruction of masses. 


(d) Literature 


There exists a huge literature on finite Markov chains. A detailed 
account of the various methods of attack and references to earlier work 
will be found in the comprehensive treatise by M. Fréchet.6 An alge- 
braic treatment of finite chains will be described in the next chapter. 
The entire theory of finite chains can be derived from Frobenius’ theory 
of matrices with positive elements. This method has been exploited 
in particular by V. Romanovsky. Unfortunately these methods do 
not carry over to the more interesting case of infinite chains, first con- 
sidered by A. Kolmogorov.” His work was continued by W. Doeblin 18 
and J. L. Doob.” The latter derived the ergodic properties from gen- 
eral group theory. Recent papers by K. L. Chung ” investigate in 
particular transitions from one state to another when certain states 


16 Recherches théoriques modernes sur le calcul des probabilités, vol. 2 (théorie des 
événements en chaine dans le cas d’un nombre fini d’états possibles), Paris, 1938. 
Another monograph on Markov chains is due to B. Hostinsky, Méthodes générales 
du calcul des probabilités, fasc. 52 of the Mémorial des sciences mathématiques, Paris, 
1931. 

1 Anfangseritinde der Theorie der Markoffschen Ketten mit unendlich vielen 
moglichen Zustinden, Matematiteskii Sbornik, N.S., vol. 1 (1936), pp. 607-610. 
This paper contains no proofs. A complete exposition was given only in Russian, 
in Bulletin de V Université d’ Etat ἃ Moscou, Sect. A, vol. 1 (1937), pp. 1-15. 

‘8 Sur deux problémes de M. Kolmogoroff concernant les chaines dénombrables, 
Bulletin Société Mathématique de France, vol. 66 (1939), pp. 1-11. 

19 Topics in the theory of Markoff chains, and also Markoff chains—denumer- 
able case, Transactions American Mathematical Society, vol. 52 (1942), pp. 37-64, 
and vol. 58 (1945), pp. 455-473. 

Ὁ K. L, Chung, Contributions to the theory of Markov chains I, Journal of Re- 
search, National Bureau of Standards, vol. 50 (1953), pp. 203-208, and II, Trans- 
actions American Mathematical Society, vol. 76 (1954), pp. 397-419. 
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are forbidden. This leads in turn to more elegant formulas for the 
central limit theorem. 

It has been shown by C. Derman 3: that in an irreducible chain with 
null states the equations (6.1) for stationary distributions admit of a 
unique solution {v,} such that Zv, = «©. The inversion formula (11.3) 
makes sense also for such solutions, and the modern theory pays in- 
creasing attention to similar uses of unbounded solutions.” Certain 
very general classes of non-Markovian processes related to Markov 
chains are treated systematically by T. E. Harris.” 


12. PROBLEMS FOR SOLUTION 


1. In a sequence of Bernoulli trials we say that at time n the state 1} is 
observed if the trials number n — 1 and n resulted in SS. Similarly 152, Εἰς, Ες 
stand for SF, FS, FF. Find the matrix P and all its powers. Generalize the 
scheme. 

2. Classify the states for the four chains whose matrices P have the rows 
given below. Find in each case P? and the asymptotic behavior of p{”. 


(a) (0, 2 2) 5), (5,0 ; 3), (3, 3 2) 0); 


(0) (0, 0, 0, ἣ (0, 0, 0, 1), (ἢ 2, 0, 0), (0,0, 1, 0); 
(c) (5,0 : 2, ) 0, 0), Ce: 2) 1,0 , 0), (, 0, 2) 0, 0), (0, 0, 0, :, 3 ) (0, 0, 0, 3) 3); 
(d) (0, 3 2) 2) 0, 0, 0), (0, 0, 0, 3) on 2), (0, 0, 0, 3 3) ry 3), (1, 0, 0, 0, 0, 0), 


(1, 0, 0, 0, 0, 0), (1, 0, 0, 0, 0, 0). 

3. We consider throws of a true die and agree to say that at time n the sys- 
tem is in state E; if 7 is the highest number appearing in the first » throws. 
Find the matrix P” and verify that formula (8.3) holds. 


4. In example (2./) find the (absorption) probabilities x, and y, that, start- 
ing from ἴω, the system will end in #; or Hs, respectively (k = 2, 3, 4, 6). 
(Do this problem from the basic definitions without referring to the formulas 
of section 8.) 

5. Treat example I(5.6) as a Markov chain. Calculate the probability of 
winning for each player. 

6. The first row of P is {pi, po, ...}. In the following rows p;;_; = 1, all 
other entries being zero. Discuss the character of the states and find the sta- 
tionary distribution, if any. 

7. The first column of P is 60, q1, ...} and pi,:41 = 1 — σφι fort = 0,1,.... 
Prove that the states are transient if, and only if, 2q; < 0. When are the 
states null states? Find the stationary distribution, if any. 

8. One reflecting barrier. Consider the random-walk matrix with p, 441 = p, 
Prk—-1 = ᾳ for k = 2, 3, ... and pio = DP, Pir = Gg. Prove that the states are 


21C, Derman, A solution to a set of fundamental equations in Markov chains, 
Proceedings American Mathematical Society, vol. 5 (1954), pp. 332-334. 

22,W. Feller, Boundaries induced by positive matrices, Transactions American 
Mathematical Society, vol. 83 (1956), pp. 19-54. 

23'T, E. Harris, On chains of infinite order, Pacific Journal of Mathematics, vol. 
5 (1955), Supplement 1, pp. 707-724. 
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transient if p > q, persistent null states if p = 4, and ergodic if p < g. Find 
the stationary distribution. 


9. Two reflecting barriers. A chain with states 1, 2, ... @ has a matrix 
whose first and last rows are (q, p, 0, ..., 0) and (0, ..., 0, g, p). In all other 
TOWS Pkk+i = ἢ, Ὀκ,ι--ι = Q. Find the stationary distribution. Can the chain 
be periodic? 

10. N black and N white balls are placed in two urns so that each urn con- 
tains N balls. The number of black balls in the first urn is the state of the 
system. At each step one ball is selected at random from each urn, and the 
two balls thus selected are interchanged. Find the pj. Show that in the 
limiting distribution the term τὰς equals the probability of getting exactly k 
black balls if Ν᾽ balls are selected at random out of a collection of N black and 
N white balls.24 


11. A chain with states Ho, #1, ... has transition probabilities 


7 ᾿ k-y 
~~ = gaa 7) Wy) —P A 
Pix =e De ( PY “(k—»! 


where the terms in the sum should be replaced by zero if y > k. Show that 


k 
py — era ve 

Note: This chain occurs in statistical mechanics 36 and can be interpreted as 
follows. The state of the system is defined by the number of particles in a 
certain volume of space. During each time interval of unit length each par- 
ticle has probability q to leave the volume, and the particles are stochastically 
independent. Moreover, new particles may enter the volume, and the prob- 
ability of r entrants is given by the Poisson expression e~A’/r!. The stationary 
distribution is then a Poisson distribution with parameter /g. 


12. Ehrenfest model. In example (2.f) let there initially be j molecules in 
the first container, and let X = 2k — a if at time n the system is in state 
k (so that X is the difference of the number of molecules in the two con- 
tainers). Let én = E(X™). Prove that 6,.μ1. = (a — 2)e,/a, whence e, = 
= (1 — 2/a)"(2j — a). (Note that en — Oasn > οοὐ) 

13. Treat the counter problem, example XIII(1.6), as a Markov chain. 


14. Plane random walk with reflecting barriers. Consider a symmetric ran- 
dom walk in a bounded region of the plane. The boundary is reflecting in 
the sense that, whenever in an unrestricted random walk the particle would 
leave the region, it is forced to return to the last position. Show that, if every 
point of the region can be reached from every other point, there exists a sta- 
tionary distribution and that wu, = 1/a, where a is the number of positions in 
the region. 


15. Repeated averaging. Let (τι, x2, ...} be a bounded sequence of num- 
bers and P the matrix of an ergodic chain. Prove that > px; > LUjz;. 
7 


24 ΤῊΪΒ problem goes back to Laplace; see Fréchet’s book (cited in footnote 16), 
p. 49. 

25S. Chandrasekhar, Stochastic problems in physics and astronomy, Reviews of 
Modern Physics, vol. 15 (1943), pp. 1-89, in particular p. 45. 
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Show that the repeated averaging procedure of chapter XIII, section 4, and 
of problem XIII, 5 is a special case. 
16. In the theory of waiting lines we encounter the chain matrix 


Po Pi P2 Ps 
Po Pi 2 Pp 
0 po pi Pe 
0 0 Po Pp. 


where {p,;} is a probability distribution. Using generating functions, discuss 
the character of the states. Find the generating function of the stationary 
distribution, if any. 

17. Waiting time to absorption. For transient E; let Y; be the time when 
the system for the first time passes into a persistent state. Assuming that the 
probability of staying forever in transient states is zero, prove that d; = E(Y¥;) 
is uniquely determined as the solution of the system of linear equations 


d; " Σ᾿ Djvdy == 1, 


the summation extending over all » such that H, is transient. However, ἀν 
need not be finite. 

18. If the number of states is a < o and if HE, can be reached from &,, 
then it can be reached in a steps or less. 

19. Let the chain contain a states and let E; be persistent. There exists 
a number 4 < 1 such that for n => a the probability of the recurrence time of 
ΕἾ; exceeding n is smaller than φῆ. (Hint: Use problem 18.) 

20. In a finite chain £; is transient if and only if there exists an E, such that 
E;, can be reached from Μ΄; but not ἢ; from £,. (For infinite chains this is 
false, as shown by problem 7.) 

21. An irreducible chain for which one diagonal element p,;; is positive can- 
not be periodic. 

22. A finite irreducible chain is non-periodic if and only if there exists an n 
such that pj’ > 0 for all 7 and k. 

23. In a chain with a states let (71, ..., 2a) be a solution of the system of 
linear equations z; = Σρ;νῶν. Prove: (a) the states Ε΄, for which z, > Ὁ form 
a closed (not necessarily irreducible) set; (6) if H; and Εἰ belong to the same 
irreducible set, then 2; = 2. 

24. Continuation. If (x1, ..., χα) is a solution of x; = sZpj2, with [8] = 1 
but 8 τέ 1, then there exists an integer ¢ > 1 such that s‘ = 1. If the chain is 
irreducible, then the smallest integer of this kind is the period of the chain. 


25. Mean ergodic theorem.?* In an arbitrary chain let 
1 n 
ΑἸ =~ Σ Phe 
y=] 


26 This theorem is a simple consequence of the results of the present chapter. 
However, it is much weaker and can therefore be proved by simpler methods; see 
K. Yosida and 5. Kakutani, Markoff processes with an enumerable infinite number 
of possible states, Japanese Journal of Mathematics, vol. 16 (1939), pp. 47-55. 
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If Εἰ and Ej, belong to the same irreducible closed set, then A tends to a 
limit which is independent of 7 and equals the stationary probability Uk, 
whenever the latter exists. If H; and E, belong to different closed sets, then 
ΑἹ = 0 for all n. If Εἰ is transient, then A — 0 for all j. 

26. Moving averages. Let ΕΥ̓} be a sequence of mutually independent 
random variables, each assuming the values +1 with probability 4. Put 
X™ = (Y, + Yn41)/2. Find the transition probabilities 


Dix(M, n) ἘΞ P(X a k|xX™ =; 7}, 


where m <n and 7, k = —1, 0, 1. Conclude that {X™} is not a Markov 
process and that (10.3) does not hold. 


27. In a sequence of Bernoulli trials say that the state 15) is observed at 
time n if the trials number n — 1 and 7 resulted in success; otherwise the sys- 
tem is in Ey. Find the n-step transition probabilities and discuss the non- 
Markovian character. 

Note: This process is obtained from the chain of problem 1 by lumping 
together three states. Such a grouping procedure can be applied to any Markov 
chain and destroys the Markovian character. Processes of this type are studied 
in the paper by Harris. 

28. Mixing of Markov chains. Given two Markov chains with the same 
number of states, and matrices P; and P:, A new process is defined by an 
initial distribution and n-step transition probabilities ΣΡ." + 1P.", Discuss 
the non-Markovian character and the relation to the urn models of chapter V. 


CHAPTER XVI* 


Algebraic Treatment 


of Finite Markov Chains 


In this chapter we consider a Markov chain with finitely many 
states H,, ..., Ha and a given matrix of transition probabilities p;x. 
Our main aim is to derive explicit formulas for the n-step transition 
probabilities p\. We shall not require the results of the preceding 
chapter, except the general concepts and notations of section 3. 

We shall make use of the method of generating functions and shall 
obtain the desired results from the partial fraction expansions of chap- 
ter XI, section 4. Our results can also be obtained directly from the 
theory of canonical decompositions of matrices ! (which in turn can be 
derived from our results). Moreover, for finite chains the ergodic 
properties proved in chapter XV follow from the results of the present 
chapter. However, for simplicity, we shall slightly restrict the gen- 
erality and disregard exceptional cases which complicate the general 
theory and do not occur in practical examples. 

The general method is outlined in section 1 and illustrated in sec- 
tions 2 and 3. In section 4 special attention is paid to transient states 
and absorption probabilities. In section 5 the theory is applied to find- 
ing the variances of the recurrence times of the states 1). 


1. GENERAL THEORY 
For every fixed j, k we define a generating function 


(1.1) Py(s) = Do pips”. 


n=1 


* This chapter treats a special topic and may be omitted. 
1 See the treatise by Fréchet cited in chapter XV, section 11. 
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Multiplying this equation by sp,; and adding over all j, we get 


(1.2) 8 >) PiPix(s) = Pyr(s) — pyre 
j=l 


For every fixed k we have here a system of a non-homogeneous linear 
equations for the a unknowns Pj;(s), ..., Pax(s). Theoretically, this 
system can be solved by means of determinants or by successive elimi- 
nations of unknowns. We use only the fact that the determinant D(s) 
of the system is a polynomial of degree not exceeding a, and that the 
P,,(s) are rational functions of s with the common denominator D{(s). 
We shall consider only the case where the equation D(s) = 0 has no 
multiple roots; this is a slight restriction of generality, but the theory 
will cover most cases of practical interest. 

Since the P,;(s) are rational functions, the partial fraction expansion 
of chapter XI, section 4, shows that there exist coefficients p?, .. 


p® such that 


ὩΣ, 


(1) (2) (a) 
(n) _ Pok Prk ted 
(1.3) | Dok 3” a Fo Se ia 


where 81, 82, ... are the roots of ἫΝ =  θ. If the degree of D(s) is 
smaller than a, then (1.3) will contain fewer than a terms. It is also 
possible that for some particular values of ν and k one or more roots s, 
are common to the numerator and denominator and hence cancel. We 
take care of such cases by letting the corresponding p? be zero. 

We could calculate the roots 8, and the coefficients p% by the methods 
of chapter XI, but it is preferable to take advantage of certain par- 
ticular properties of Markov chains. Multiply equation (1.3) by Djv 
and sum over ν = 1,2, .... The result is 


1.4 (n “ oy Py 
gina Sony (HP 6] 


If the left side is expressed by means of (1.3), we get an | identity 
which can hold for all n only if the coefficients of 81. ", Sa” on both 
sides are equal. This means that for every fixed r we ree have 


(1.5a) ow = δ» > Dive , C= I, coe, G. 


v==] 


In like manner we get, on multiplying (1.3) by pim and adding over all k, 


(1.5b) Pim = 8, Dy PE Dim: 


k=l 
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The relations (1.5a) show that for & and r fixed the a quantities 


pir, -.., p® are a solution of the system of a linear equations 
a 
(1.6a) 2 = 5, > px” G=1,...,@) 
v=] 


Similarly, relations (1.5b) imply that for ν and r fixed, the p%, .. spp 
satisfy the a linear equations 


(1.6b) γί = 8, > Y Dim (m= 1,..., a). 
Emel 


For a better understanding let us replace s, by an arbitrary s and 
study the two more general systems 


(1.7a) L; = 8 δ Ῥινῶῦν (, ΞΞῚ,..., α) 
Y==l 

and 

(1.76) Ym = 8 » YkPkm (m ca 1, ΟΣ a). 
k= 


A system of a homogeneous equations in a unknowns can have a 
non-trivial 2 solution only if its determinant vanishes. Now the ma- 
trices of the two systems (1.7a) and (1.7b) are the same except that 
rows and columns are interchanged. Their determinants are therefore 
equal. Moreover, the determinant of (1.7a) obviously equals the de- 
terminant of the system (1.2), which means that the determinants of 
the two systems (1.7a) and (1.7b) vanish for s = 8), 89, ..., 89. 

We can now forget about the generating functions P;;,(s) and define 
the roots s, as those numbers (real or complex) for which the systems 
(1.7a) and (1.7b) admit of non-trivial solutions. The assumption that 
s, is a simple root means that for every fixed r the solutions (z{”, .. ., 2) 
and (ἢ, ..., y&) are uniquely determined except, of course, for a 
numerical factor. However, our starting point was the discovery that, 
for k and r fixed, (p{?, ..., pf?) is a solution of (1.7a), while for ν and r 
fixed (p{?, ..., p) is a solution of (1.7b). Since these solutions are 
determined up to a numerical factor, we must have 


(1.8) pp = οὐ υί. 


There remains only the calculation of the constants 61, ..., Cg. 


2 As usual we call an identically vanishing solution trivial and disregard it. 
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From (1.8) and (1.3) we have 


(1.9) PP = D cal rls, 
r=] 
Therefore | 
(1), (1) (a), (a) 
7) Vk 17 Yk 
(1.9a) Ρκ(8) = cy Ἔν. δὰ 
8; — δ Sa — § 
Using (1.6a) we get from (1.1) 
a οΌ 10 
(1.10) DY Pi(s)2 = 2 YS 97" = 1 
k=1 n=1 δ, — § 


On the other hand, if we evaluate the left side using (1.9a), we are 
lead to a sum of a fractions with denominators ὃν — 8. It follows 
that for ν ~ r the numerators must vanish and 


(1.11) 1=c, > ae Ue 
v==l 
and thus we have found c,. It is true that the solutions 2 and y” 
are determined only up to a numerical factor. However, if we replace 
the «*) by Az{?, and the yf? by Byj”, then c, will be changed into 
¢,/AB and the quantity p{? of (1.8) remains unchanged. 
Summarizing, we have the following procedure to calculate p?. 


Write down the two systems of linear equations (1.7a) and (1.7b). 
They have a common determinant and admit of non-trivial solutions only 
for values of s for which this determinant vanishes. We suppose that 
the roots 81, 82, ... (of which there are at most a) are simple; then for 
each r, the solitons (x, ..., 28) and (y?, ..., y) are determined up 
to an arbitrary multiplicative conslant Find these solutions and the con- 


stants c, from (1.11). Then p} is given by (1.9). 


For every fixed r the p{? form a matrix which may be constructed in 
the following way. Form a multiplication table with the «(Ὁ heading 
the rows and the γί) heading the columns. Multiplying all a? elements 
of this square table by c,, we get the matrix p{. To construct the 
matrix (pi) we have to divide all elements of ps by s,” and add the 
matrices thus obtained for r = 1, 2, ..., a. Note that the roots s, 
may be simple even if there are fewer than a roots. 

The case of multiple roots requires certain changes but may be 
treated by similar methods. The case of greatest interest will be dis- 
cussed in section 4. 
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In algebra the reciprocals ἃ, = 1/s, are called characteristic values (or eigen values 
or latent roots) of the matrix P. Zero is a possible characteristic value, but to it 
there corresponds no root s,. This explains why there may be fewer than a roots s, 
even though there are always a characteristic values. The use of s, rather than of 
their reciprocals is more convenient for the method of generating functions. More- 
over, it corresponds to the general usage in the theory of integral equations and is 
therefore more natural in probability theory. 

The value s = 1 always occurs among the s,, and to τὲ there corresponds the solution 
(1,1, ..., 1) of (1.7a). For all r we have |s,| > 1. In fact,? a root 8, with |s,| «1 
would lead to a divergent development in (1.3). If 81 = 1 is the only root with 
[5, = 1, then pi? > eafPyf?. It is not difficult to show that if there exist 
other roots with |s,| = 1, then they are necessarily éth roots of unity, where ¢ is 
an integer; in this case the chain has period ¢. For details the reader is referred to 
Fréchet’s treatise quoted in chapter XV, section 11. 

Often it is cumbersome or impossible to find all roots s,. However, it is clear 
that the asymptotic behavior of p{f is determined in first approximation by the 
8, with |s,| = 1, and in second approximation by the roots s, with the next smallest 
absolute value. 

The final formula (1.9) can be written more elegantly in matrix notation. Let 
ΧΟ) be the column vector (or an a X 1 matrix) with elements z$” and let Y“ be 
the row vector (or a 1 X α matrix) with elements yf”. Then XY is the a X a 


matrix with elements x{y{” and (1.9) takes on the form 


(1.12) P® = Σ᾽ cpp * XO YO) where cc, = YOOX), 


r=] 


The vectors X) and Y are called latent vectors or eigen vectors, and c,~! is their 
inner product. 


2. EXAMPLES 


(a) Consider first a chain with only two states. The matrix of 
transition probabilities assumes the simple form 


P= (’ pp ) 
a l1—ea 
where 0<p<1 and 0 «α «]1. The equations (1.7a) reduce to 
s(1 — ρ)χι + spre = 2, and sax, + 8(1 — αὐ = Xo. Equating the 
two ratios 21/22, it is found that a solution exists only if either s = 1 
ors = 1/(1 —a-—p). The solution corresponding to 81 = 1 is (1, 1); 
the solution corresponding to so = 1/(1 — a — p) is (p, —a). Next 
take the system (1.7b) which now reduces to s(1 — p)y; + saye = y 
and spy; + s(1 — @)ye = ye. We know that it can be solved only 
when 8 = 81 or 8 = 8. The corresponding solutions are (a, p) and 
(1, —1). From (1.11) we get c, = cg = 1/(a + p). Equations (1.9) 
and (1.11) now enable us to write down explicit formulas for the quan- 


3A direct proof is as follows. Let M be the largest term in the sequence 
ia |,..., [12] (r fixed). Then from (1.6a) M < |s,| DpjM = |s,|M or 
[8.1 > 1. 
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tities py. The final result can be written in matrix form 


p= —_(" . a a id 
atp\a p a+ p πα @ 


(where factors common to all four elements have been taken out as 
factors to the matrices). Since [1 — a — p| < 1, the second matrix 
tends to zero as ἢ — οὐ, and the first matrix represents the limiting 
form of P”. 


(b) Let 
0 0 0 1 
(2.1) 7 0 0 0 1 
| 2 2 0 0 
0 0 1 0 
[this is the matrix of problem XV,2(b)]. The system (1.7 a) reduces to 
δία + 2 
(2.2) x, = 814, Ὅ2 = 8X4, 3 = ae U4 = 873. 


Since a multiplicative constant remains arbitrary, we may put χᾳ = 1. 


Then 21 = 8, 72 = 8, X3 = 8”, 14 = 85, and therefore we must have 
85 = 1. Now if we put 


— p2ri/3 - 2π Pe ων 
(2.3) d=e cos : + 7 sin Ξ 


then the three roots οἵ 85 = 1 are 81 = 1, 82 = θ, 84 = θ2, (Note that 
we have only three roots, even though there are four states.) The 
solutions x{” corresponding to the three roots are (1, 1, 1, 1), @, 6, 67, 1), 
(05, 67, θ, 1). 

From system (1.70) we get y, = sys/2, ys = sy3/2, y3 = SY4, 
ys = ϑίψι + y2). The three sets of solutions corresponding to 8; = 1, 
8) = 6, and 85 = 6” are (1, 1, 2, 2), (0, 0, 2, 26), (6, 92, 2, 20). There- 
fore from (1.11) οἱ = %, co = 1/(60") = 6/6, cz = 1/(60) = 62/6. We 
are now able to express all pe. For example, 


Ἢ 1 ἜΝ gr 4. g2n 


(nm) 
Pil P22 δ 
1 ἘΞ g2nt2 Ἔ grt 
(n) . 
(2.4) Piz 2 
τ 1 1 g2ntl “ς θ1::}2 
Pig FS ΞΣΞΞΞ  ΞΞ Ξϑϑς 


9 


etc. The chain is obviously periodic with period 3. 
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(c) Let p + q = 1, and 


Op 0 q 
0 0 
(2.5) pea 
0 q 0 p 
p 0 q 0.: 


[This matrix describes a cyclical random walk; see example XV(2.d).] 
The equations (1.7a) reduce to x1 = s(pro + 414), T2 = 8(qxr1 + p%z), 
14 = δίχα + pr4),%4 = ϑβ(ρχι + 413). Suppose that p ¥ g. From the 
first and the third equations we find 2; + x3 = s(zq + x4), and from 
the remaining equations rg + 24 = s(z1 + 23). Hence we have either 
82 = 1or2,+23 = 2%2+24=0. The first alternative leads to the 
two roots s; = 1,8. = —1. On the other hand, substituting 73 = —-2,, 
14 = —2Xp into the first two equations, we find s*(p — 4)" = —1, which 
yields the remaining two roots sz and 84. Thus 
ΐ a 

= ᾿ 81 ΞΞ - 

q—p g—p 
(where 7? = —1). The corresponding solutions a” contain an arbitrary 
factor, and we are free to put z{? = 1. Then the four sets of solutions 
are easily found to be (1,1,1,1), (—1,1, —1,1), (ὦ —1, --, 1), 
(—7, —1,7,1). The system (1.76) reduces in our case to y; = s(qy2 + 
+ pys), Yo = s(pyi + ays), Ys = ϑίρψε + 404), Ys = δίαψι + Dy). To 
the four roots (2.6) there correspond the solutions (1, 1, 1, 1), 
(—1,1, —1, 1), (-7, —1,2,1), (ὦ, —1, —2,1). For the constants c, 
we find from (1.11) cy = co = cz = ᾳ = 1, Using (1.3) and (1.8), we 
can now write an explicit formula for each sequence p (n = 1, 2, 
3,...). In the present case the solutions x{” and y}” are of the simple 
form (a, a”, a, a*), where a is one of the four numbers 1, —1, 7, or —7. 
This enables us to express the p\” by the single formula 


(2.7) pp =4{1+ @—pv)*@* || + (- ἡ τη, 
This formula is valid also for Ὁ = ᾳ = §. 

It is seen that the term involving (ῳ — p)” tends to zero, and that 
the other term has period 2. 

(d) General cyclical random walk [example XV(2.d)]. In the preced- 
ing example we were able to express the a” and y\” as powers of the 
four fourth roots of unity. This suggests trying a similar procedure 


for the general matrix of example XV(2.d). It is convenient to number 


(2.6) s, =1, 82) = --Ι, 83 


᾽ 
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the states from 0 toa — 1. For brevity we put 
(2.8) θ = εἴτ! α 


This is an ath root of unity, and all ath roots are represented by the 
sequence 1, 6, 6, ..., 0°. It is easily seen that the systems (1.7a) 
and (1.7b) are satisfied by the a sets of solutions 


(2.9) a = 9", y? = 6-7 

with r = 0, 1, 2, ..., a—1; they correspond to 
a—l - 

(2.10) ae | > os" | | 
y=0 


From equations (1.11) and (2.9) we find c, = 1/a for all r, and thus 
finally 


1 α--} : α--ἰ n 
(2.11) pie =~ Do 95: ( pS 495) : 
α r=0 v=0 
It is interesting to verify this formula for n = 1. The factor of Ων is 
a—l 
(2.12) Σ Gk) 
r=0 


This sum is zero except when j — k + ν = 0 or a, in which case each 
term equals one. Hence p reduces to q,_; if k > j and to Ga+k—; if 
k «1, and this is the given matrix (p;,). 

(e) The occupancy problem. Example XV(2.g) shows that the clas- 
sical occupancy problem can be treated by the method of Markov 
chains. The system is in state j if there are j occupied and a -- 7) 
empty cells. If this is the initial situation and n additional balls are 
placed at random, then p{? is the probability that there will be & 
occupied and a — k empty cells (so that pf? = Oifk <j). Forj =0 
this probability follows from formula II(11.7). We now derive a for- 
mula for p¥, thus generalizing the result of chapter IT. 

Since pj; = j/a and pj,;41 = (a — j)/a, it is easily seen that the sys- 
tem of equations (1.7a) reduces to 


(2.13) (a ~ 8) ) x; = s(a — Jj) X41; j = 0, 260, GQ, 


For s = 1 we get the solution z; = 1. It is clear that if s ~ 1 then 
tq = 0, so that 8 = 1 is the only value of s for which all z; are different 
from zero. If s is any other value for which (2.13) has a solution, 
then there must exist some index r <a such that 2,4; = 0 but 
--ὦ, τέ 0; from (2.13) it then follows that sr = a. Thus the roots 8, 
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for which (2.13) has solutions are 8, = a/r with r = 1, 2, ...,a. The 
corresponding solutions of (2.13) are obtained successively, putting 
αὐ) = 1, andj =0,1,.... We find 


(2.14) | Ὁ = (’) + (‘) 


so that 2S” = 0 when j > τ. 
For s = s, the system (1.7b) reduces to 


(2.15) (r — jy? = ( --7-Ὁ Vyir1 

and has the solution 

(2.16) yy” = ( ᾿ Ἢ Gl 

where, of course, y = Oifj <r. Since x” = Oforj > randy” = 0 


for j] <r, we sie find from equation (1. Ἢ that c, = αὖ @y-1 = 


= [ and hence 
ain EN O(MCr+() 


On expressing the binomial coefficients in terms of factorials, this 
formula simplifies to 


(2.18) pie = in = (4) (= πω 


with pY = 0 if k <j. 
(Further examples are found in the following two sections.) 


3. RANDOM WALK WITH REFLECTING BARRIERS 


The application of Markov chains will now be illustrated by a 
complete discussion of a random walk with states 1, 2, ..., a and two 
reflecting barriers. The rows number 2, 3, ..., a—1 of the matrix P 
are determined by peni1 = p and pex—1 = 4; the first and the last 
rows are defined by (q, p,0,...,0) (0, ...,0,¢,p). The matrix of 


4 Part of what follows is a repetition of the theory of chapter XIV. Our quadratic 
equation occurs there as (4.7); the quantities \1(s) and λε(8) of the text were given 
in (4.8), and the general solution (3.3) appears in chapter XIV as (4.9). The two 
methods are related, but in many cases the computational details will differ radi- 
cally. 
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example XV(2.c) reduces to this when ὃ = 1. In the terminology of 
random walks, p{ is the probability that the particle which starts 
from + = 2 is at time n at x = k. 

The equations (1.7a) take on the form 


ty = δίφυι + pr) 
(3.1) tj = s(qxj—1 Ἢ PXj+1) (j = 2, 3, ...,a—1) 
La = 8(Qla_1 + Ῥχα). 


This system admits the solution z; = 1 corresponding to the root 
s = 1. To find all other solutions we apply the method of particular 
solutions (which we have used for similar equations in chapter XIV, sec- 
tion 4). The middle equation in (3.1) is satisfied by z; = λ΄ provided 
that is a root of the quadratic equation \ = gs + \”ps. The two 
roots of this equation are 


1+ (1 — 4pgqs?)! 1 — (1 — 4pqs*)t 
2 Ae. Ee π᾿ 
2ps 2ps 


and the most general solution of the middle equation in (3.1) is therefore 
(3.3) x; = A(s)A1’(s) + B(s)do*(s), 


where A(s) and B(s) are arbitrary. The first and the last equation in 
(3.1) will be satisfied by (3.3) if and only if x9 = 2; and 2, = Taig 
This requires that A(s) and B(s) satisfy the conditions 


A(s){1 — λι(8)} + B(s){1 — As(s)} = 0 
A(s)dy°(s) {1 — Ax(s)} + B(s)do%(s) {1 — Ao(s)} = 0. 
However, these two equations are compatible only if 


(3.5) di°(s) = λ9" Ὁ), 


(3.4) 


and we have to determine the values of s for which (3.5) is possible. 

From the definition (3.2) we have A;(s)\2(s) = 47}, and (3.5) implies 
that A1(s)(p/q)* and d2(s)(p/q)! must be (2a)th roots of unity. These 
roots can be written in the form 


ah ὃ ΠΥ 
(8.6) οἾΤΙα = cos— + ἢ 51ὴ --- 
α α 


where #7 = —landr = 0, 1,2,...,2a—1. Thus all solutions of (3.5) 
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are among the roots of 


3 = ὲ ° 
1(s) = (4) ere 2(s) as (2) ett la 
Ρ Pp 


To each value r we can find a root s,, namely 
(3.7) «8 = {2(pq)* cos xr/a}—. 


The value r = a must be disregarded, since for it d1(s) = Ag(s), 
A(s) = —B(s), so that it leads only to the trivial solution zx; = 0. 
Tor = 0 there corresponds the solution z; = 1, which we have already 
considered. To r = 1, 2, ..., a—1 there correspond a — 1 distinct 
solutions; if we let r = a+1, a+2, ..., 2a—1, we get the same solu- 
tions with \;(s) and de(s) interchanged. Thus we have found a distinct 
sets of solutions of (3.1), and we know that there can be no more. 

For 8 = s, with r = 1, 2, ..., a—1 we get from (3.4) 2A(s) = 1 — 
— do(s) and 2B(s) = —{1 — Ax(s)}. (Remember that a multiplicative 
constant remains arbitrary.) Substituting into (8.3), we find the 
a — 1 sets of solutions 


3.8) 2? = (2 sin ee (2 sin iS Ξ 

Ρ a p a 
(r= 1, 2, ...,a—1). To this we add the solution previously found 
(3.9) a = 1. 


It is easy to verify that (3.8) and (8.9) represent solutions of the given 
system (8.1). 

We have now to find solutions of the second system of linear equa- 
tions. In the present case (1.7b) takes on the form 


ψι = βηίψι + Y2), 
(3.10) Yn = S(pyr—1 + QYK+1); (k = 2,...,a—1) 
Ya = Sp(Ya—1 + Ya). 


The middle equation is the same as (3.1) with p and q interchanged, 
and its general solution is therefore obtained from (8.3) simply by 
interchanging p and g. The first and the last equations can be satis- 
fied if s = s,, and a simple calculation shows that for r = 1, 2, ..., 
a—1 the solution of (8.10) is 


ke γᾷ ἐπ} aer(k — 1 
(3.11) yf = (7) sin —— - (2 sin lee 
4 α 4 a 
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For s = 1 we find similarly 
k 
(3.12) yO = (7) | 


The next step consists in evaluating the coefficients c, in (1.11). 
The sum simplifies if sin? rrj/a is expressed in terms of the cosine of 
the double angle, and this in turn by means of complex exponentials. 
Then we have only to sum finite geometric series and find easily 


2p ar | > 
(3.13) C=—j1—- 2(p0)* cos | (r= 1,2,...,a—1). 
a a 
For r = 0 we get, 
—1 
(3.14) 1: 
P (p/q)* — 1 


provided that p #q. If p = ᾳ = 2, then (3.13) remains valid, but 
(3.14) is to be replaced by cp = I/a. 
These formulas lead to the final result 


n (p/q) ~1 p\*-! Qr tly tin—I+k) ἐν Ἐ7--) α--ἰ 
a ( 4)" — 1 ) Σ, 


where S, stands for 


sd τις (ἢ. δύ οὶ ae (ἢ. = om 
cos” — ) sin — — { — } sin —————} 7 sin —~ — [( — } ρη--- - 
a a p a a p a 
ee eee SS 
Tr 
1 — 2(pq)? cos — 
a 


φ 


a r=1 


Asn -- οὐ, the second term in (3.15) tends to zero, and we find again 
that pj’ tends to a stationary distribution independent of ἡ. (This 
limiting distribution was derived by other methods in problem XV, 9.) 
Passing to the limit a — ©, we get the formula for a random walk 
with a single reflecting barrier; in the limit, the sum is replaced by an 
integral. 


* For analogous formulas in the case of one reflecting and one absorbing barrier 
see M. Kac, Random walk and the theory of Brownian motion, American Mathe- 
matical Monthly, vol. 54 (1947), pp. 369-391. The definition of the reflecting bar- 
rier is there modified so that the particle may reach 0; whenever this occurs, the 
next step takes it to 1. The explicit formulas are then more complicated. Kac 
also found formulas for pz in the Ehrenfest model [example XV(2,f)]. 
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4. TRANSIENT STATES; ABSORPTION PROBABILITIES 


The theorem of section 1 was derived under the assumption that the 
roots 81, 85, ... are distinct. The presence of multiple roots does not 
require essential modifications, but we shall discuss only a particular 
case of special importance. The root 8; = 1 is multiple whenever the 
chain contains two or more closed subchains, and this is a frequent 
situation in problems connected with absorption probabilities. It is 
easy to adapt the method of section 1 to this case. For conciseness and 
clarity, we shall explain the procedure by means of examples which 
will reveal the main features of the general case. 


Examples. (a) Consider the matrix of transition probabilities 


+ 200 0 0 
2130 0 0 

0022320 0 

(4.1) P = 5. 

0024 40 0 
1:0 τ Ὁ 
2. od 2, Ck. 2k ok 
6 6 6 6 6 6 


It is clear that ZH; and E2 form a closed set (that is, no transition is 
possible to any of the remaining four states; compare chapter XV, sec- 
tion 4). Similarly Z3 and HE, form another closed set. Finally, Es and 
Ες are transient states. After finitely many steps the system passes 
into one of the two closed sets and remains there. 

The matrix P has the form of a partitioned matrix 


A 0 0 
(4.2) P={|0 80 
OU V T 


where each letter stands for a two-by-two matrix and each zero for a 
matrix with four zeros. For example, A has the rows (4, %) and 
(2, 3); this is the matrix of transition probabilities corresponding to 
the chain formed by the two states H, and Hz. This matrix can be 
studied by itself, and the powers A” can be obtained from example 
(2.a) with p = a = 2. When the powers ΡΖ, P®, ... are calculated, 
it will be found that the first two rows are in no way affected by the 
remaining four rows. More precisely, P” has the form 


A” 0 0 
(4.3) p™=|0 B" 0 
Un Vn Τ' 
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where A”, B”, T” are the nth powers of A, B, and T, respectively, and 
can be calculated δ by the method of section 1 (cf. example (2.a) where 
all calculations are performed). Instead of six equations with six un- 
knowns we are confronted only with systems of two equations with two 
unknowns each. 

It should be noted that the matrices U, and V, in (4.3) are not 
powers of U and V and cannot be obtained in the same simple way as 
A”, B”, and Τὴ, However, in the calculation of ΡΖ, P®, ... the third 
and fourth columns never affect the remaining four columns. In other 
words, if in P” the rows and columns corresponding to EZ; and E4 are 
deleted, we get the matrix 


(4.4) pa a 
᾿ U, 7 
which is the nth power of the corresponding submatrix in P, that is, of 
xz 3 0 0 
A 0 3 3 0 0 
(4.5) ( )- : , 1 1 
OC. 40 ς ς 
1 Δ 1 ἃ 
6 6 6 6 


Therefore matrix (4.4) can be calculated by the method of section 1, 
which in the present case simplifies considerably. The matrix V, can 
be obtained in a similar way. 

Usually the explicit forms of U, and V,, are of interest only inasmuch 
as they are connected with absorption probabilities. If the system 
starts from, say, Es, what is the probability ἃ that it will eventually pass 
into the closed set formed by E; and Ez (and not into the other closed 
set)? What is the probability d, that this will occur exactly at the nth 
step? Clearly p? + pi is the probability that the considered event 
occurs at the nth step or before, that is, 


ps» + pS τλι tro +... + An. 


Letting n — οο, we get X. A preferable way to calculate Δ, is as fol- 
lows. The (n—1)st step must take the system to a state other than 
FE and Eo, that is, to either EH’; or Eg (since from Es or FE, no transi- 
tion to ΨΚ, and Ez is possible). The nth step then takes the system to 


6 In 7 the rows do not add to unity so that 7 is not a stochastic matrix. How- 
ever, the method of section 1 applies without change, except that s = 1 is no 
longer a root (so that 75 — 0). 
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Ey, or Ey. Hence | 
n = pss (ps1 + ps2) + pS” (per + Poa) = 
= 8"? + o>”. 


It will be noted that X, is completely determined by the elements of 
Το and this matrix is easily calculated. In the present case 


(4.6) 


pss = pse = 15)" ~~ and hence Ay = σε οὔ)" ἢ, 


(b) Brother-sister mating. As a second example we give a complete 
treatment of example XV(2./). A glance at the matrix shows that the 
states H, and Es form a closed set each (a fact which is clear from the 
biological meaning). If the system starts from any other state E,, it 
will eventually pass either into £, or into Es; and then remain there. 
The breeder desires to know the corresponding probabilities and the 
expected duration of the process. 

Deleting the first and fifth column and row, we get the reduced 
Matrix 


$20 0 

+i 43 
(4.7) T = 

0 220 

1 0 O 


The powers 7” will now be calculated by the method of section 1. 
They represent the transition probabilities among transient states. 
The equations (1.7a) reduce to 


8(2x1 + 22) (2x1 + 272 + 2xg + x4) 
αι = ————: Lg = --ττ τττ-ττ------.-----  ΄, 
4 8 
(4.8) 
8(xq + 223) 
w3 = ἷ aie 4 = 870. 


This has a solution only if the determinant vanishes, and this condition 
leads to a fourth-degree equation in s. To simplify writing we put 


(4.9) pit, ἐς εϑεξε ες 

Then the four roots 8, are 

(4.10) 81 = 2, 82) = 4, 88 = hy, 84 = —Oo, 
and the corresponding solutions (a, ..., t§?) of (4.8) are 


(4.11) (1, 0, =i, 0), (1, =, 1, —A4), a, θι, 1, 6,7), (1, — 62, 1, 0”). 
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The system of linear equations for yf” is obtained by specialization 
from (1.76), and the four sets of solutions are in proper order 


(4.12) (, 0, = 0), (1, a], 1, — 3); 
(1, 61, 1, 6,7/8), (1, --θς, 1, B2”/8). 


From (1.11) we find the four constants 6} = 3, ο = 4, cs = 092/40, 
C4 = 6;7/40. From (1.8) we get the pe; and finally (1.3) gives us 
Dye for all transient states, that is, for j,k = 2,3,4,6. For fixed j, k the 
sequence p\? is the sum of four geometric series with ratios $1, «.., 84. 

An absorption in E; exactly at the nth step is possible only if the 
(n —1)st step takes the system into either 12 or E3, and the nth step 
into H,. The probability for this is p$~”/4 + p$-?/16. Similarly, 
the probability of absorption at Hs is p§~”/16 + p@-”/4. Sum- 
ming over all n we get the probabilities that the system will eventually 
pass into and stay in 11 and Es, respectively. The actual calculation 
of these probabilities requires only the summation of four geometric 
series. 


5. APPLICATION TO RECURRENCE TIMES 


In problem XIII, 19 it is shown how the mean yp and the variance 
o” of the recurrence time of a recurrent event & can be calculated in 
terms of the probabilities u, that & occurs at the nth trial. If & is 
not periodic, then 


1 i) 1 on — pu -- 2 
(5.1) Un -Ὁ - and » (ι, - ἢ = --“, 
μ n=O μ 2μ 


provided that o? is finite. 

If we identify ὃ with a persistent state Z;, then un = p™ (and 
Up = 1). In a finite Markov chain all recurrence times have finite 
variance (cf. problem XV, 19), so that (5.1) applies. Suppose that E; 
is not periodic and that formula (1.3) applies. Then s, = 1 and 
|s-| > 1 for r = 2, 3, ..., so that p® — p® = 1/p; To the term 
Un — 1/p of (5.1) there corresponds 


1 a 
(5.2) py —— = 2 os. 
bj r==2 


This formula is valid for n > 1; summing the geometric series with 
ratio 8,7 ἢ, we find 


οὐ δ 1 Ν α ps? 
(5.3) DAP; -—)= Σ ---ἪὋ- 


γε δ᾽ or, ] 
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Introducing this into (5.1), we find that if H; is a non-periodic persistent 
state, then its mean recurrence time is given by μ; = 1/p{), and the variance 
of rts recurrence time 18 

a (7) 

P73 
(5.4) σι = py — uj? + Qu? D> -Ξ.-, 
Ts=2 Sr "" 1 

provided, of course, that formula (1.3) is applicable and 8: = 1. The 
ease of periodic states and the occurrence of double roots require 


only obvious modifications. 


CHAPTER XVII 


The Simplest Time-Dependent 


Stochastic Processes ' 


1. GENERAL ORIENTATION 


Random walks and Markov chains are stochastic processes 2 where 
changes occur only at fixed times, say, = 1, 2,3, .... On the other 
hand, in chapter VI, sections 5-6, we were concerned with phenomena 
such as telephone calls, radioactive disintegrations, and chromosome 
breakages, where changes may occur at any time. Obviously a com- 
plete description of such processes leads beyond the domain of discrete 
probabilities. To fix ideas, consider the incoming calls at a telephone 
exchange (or, rather, an idealized mathematical model of the actual 
process). Every instant ¢ corresponds to a trial, and the result of an 
experiment may be described in terms of a function X(é) giving the 
number of calls up to time ¢. If the first call occurs at time t, the 
second at te, etc., the function X(t) equals 0 for 0 <t< ty, 1 for 
ty <t <b, 2 for tg <¢t < ts, ete. Conversely, every non-decreasing 
function ΧΑ), assuming only the values 0, 1, 2, ..., represents a pos- 
sible development at our telephone exchange. In other words, a com- 
plete description of our conceptual experiment calls for a sample space 
whose points are functions X(é) (and not sequences as in the case of 
discrete trials). A compound event such as “seven calls within a 
minute on a certain day” is obviously the aggregate of those X(t) 
which satisfy the condition that for some point ¢ of a specified interval 
we have X(¢-+h) — Χ( > 7, where hf represents the span of one 
minute. 

We cannot deal here with such complicated sample spaces and must 
defer the study of the more delicate aspects of the theory. Fortunately, 


1 This chapter is almost independent of chapters X-XVI. 
2 See footnote 11 of chapter XV. 
397 
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certain interesting questions can be answered even with the simple 
means now at our disposal. 

If we limit the consideration to the number of calls X(¢) within an 
arbitrary but fixed period of duration t, then X(¢) is a random variable 
of the familiar type, assuming the values 0, 1, 2, .... Let Pr(t) be 
the probability that Χ( = n. It is true that the distribution {P,(} 
depends on a continuous parameter, but so do most distributions intro- 
duced in this book. 

The situation is best illustrated by the Poisson distribution 


(At)” 


(1.1) P,(t) = e — 
nl 


It was derived in chapter VI, section 5, as a limiting form of the bino- 
mial distribution; a more satisfactory derivation is contained in chap- 
ter XII, section 3. We shall not use the results of that chapter, but 
the situation analyzed there is so simple and so typical that a short. 
summary may serve as the best introduction to the present chapter. 

Consider a stochastic process represented by an integral-valued ran- 
dom variable X(t) > 0. Intuitively we may interpret X(é), say, as the 
cumulative damage by lightning measured to the nearest dollar. We 
arrive at a particularly simple mathematical model if we introduce two 
postulates as follows. The increment X(¢-+ 8) — X(O) during the 
time interval from 0 to ἐ + s is the sum of the increments X(s) — X(0) 
and X(¢ + s) — X(s) corresponding to the subintervals from 0 to 8 and 
from stot -" 8. We postulate, first, that these increments X(s) — X(0) 
and X(¢ + s) — X(s) are stochastically independent and, secondly, that 
the distribution of X(¢ + 8) — X(s) depends only on ¢ (i.e., only on the 
length of the interval, not on its position: this is the property of 
homogeneity in time). 

Let hn(t) be the probability that X(t 4. s) — X(s) assumes the 
value n (where n = 0, 1, 2, ...). Analytically, the independence of 
X(t +s) — X(s) and X(s) -- ΧΟ) is expressed by 


(1.2) hn(t + 8) = D1 Aj(s)Pn—s(2). 
j=0 


It has been shown in chapter XII, section 3, that the only distribu- 
tion {h,(t)} with the property (1.2) is the compound Poisson distribu- 
tion; that is, X(¢) has the distribution of a random variable 


At)” 
jw 00 


(1.3) S$. with P{N =n} = | 
(7a 
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where S, = Y,; + Yo +...+ Y,, is the sum of n mutually independent 
variables with the common distribution {f,;}, 7 = 0, 1,2, .... In our 
example {f;} represents the probability distribution of the damage 
from an individual hit by lightning; then (1.3) states that the number 
of hits in a time interval of length ὁ obeys the Poisson distribution 
(1.1), and that the individual damages are independent random varia- 
bles. The variable (1.3) has the same probability distribution as the 
change X(é - 8) — X(s) during an arbitrary interval of length t, and 
we see that this total change is the sum of a random number N of 
individual changes or jumps. The number N of changes has a Poisson 
distribution (1.1), and the individual jump has the probability distri- 
bution {f;}. In particular, the Poisson distribution (1.1) itself repre- 
sents the special case where all jumps are of unit length (that is, 
fi = 1, fo = fo =... = 0, the variables Y, assuming only the value 1). 

It will be observed that we have found a characterization of the 
simple and the compound Poisson distribution by means of intrinsic 
probabilistic properties. The Poisson distribution no longer appears 
as an approximation or a limiting form of other distributions but stands 
in its own right (or, we might say, as the expression of a physical law). 
Its derivation is of a purely analytic character, the notion of a stochastic 
process and the random variable X(t) serving only to get a set of plau- 
sible postulates on the distribution {h»(é)}. For many applications, 
nothing beyond the knowledge of {h,(¢)} is required. Theoretically, 
it should be shown that {h,(¢)} really determines a family of random 
variables X(t) and all relevant probability relations such as the prob- 
ability of the event that X(é) will ever exceed at + ὃ (this is the ruin 
problem of the collective risk theory in insurance). 

Questions of this type lead beyond the scope of this book. We shall 
be content to translate a physical description of a process into proper- 
ties required of the basic probabilities P,(¢) and to consider {P,(t)} as 
a family of discrete probability distributions depending on t. 

This artificial limitation to discrete probabilities has unavoidable 
drawbacks. Consider, for example, the zero term in (1.1). We inter- 
pret 


(1.4) Po(t) ΞΞ e—*t 


as the probability that no call occurs within an observation period of 
duration ¢. This formulation suggests that Po(é) might be interpreted 
as the probability that the waiting time (starting at an arbitrary mo- 
ment) up to the first call exceeds ἐ. It can be shown that this interpre- 
tation is correct, but it will be noticed that it involves probabilities in 
a continuum. The operational meaning of our first formulation is as 
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follows: Make a series of “identical observations” with a fixed observa- 
tional period ¢. Each trial results in either ‘‘no call’’ (success) or “one 
or more calls” (failure). Then we have Bernoulli trials with the prob- 
ability of success 6 δύ, With the second interpretation we are to wait 
until a call arrives. Every positive number is a possible waiting time, 
so that the sample space corresponding to each trial is the half-line 
t> 0. Formula (1.4) then represents a continuous probability distri- 
bution. 


2. THE POISSON PROCESS 


We begin by giving a new derivation of the Poisson distribution; it 
is by no means better than the derivation described above, but it lends 
itself more naturally to various generalizations which we propose to 
study. 

Take a system subject to instantaneous changes due to the occur- 
rence of random events such as splitting of physical particles, arrival 
of telephone calls, or breakage of a chromosome under harmful irradia- 
tion. All changes are assumed to be of the same kind, and we are con- 
cerned only with their total number. Each change is represented by a 
point on the time axis, so that we are studying certain random distri- 
butions of points on a line. 

The physical processes which we have in mind are characterized by 
the two properties, that they are homogeneous in time and that future 
changes are independent of past changes. By this we mean that the 
forces and influences which determine the process remain absolutely 
unchanged, so that the probability of any particular event is the same 
for all time intervals of length ¢, independent of where this interval is 
situated and of the past history of the system.? 

We now translate this description into mathematical language. The 
process is to be described in terms of probabilities 4 P,,(¢) that exactly n 
changes occur during a time interval of length ¢. In particular, Po(¢) 
is the probability of no change, and 1 — Po(t) the probability of one 
or more changes. We shall assume that® asi — 0 


3In a telephone exchange incoming calls are more frequent during the busiest 
hour of the day than, say, between midnight and 1 a.m.; the process is therefore 
not homogeneous in time. However, for obvious reasons telephone engineers are 
concerned mainly with the “busy hour’ of the day, and for that period the process 
can be considered homogeneous. Experience shows also that during the busy hour 
the incoming traffic follows the Poisson distribution with surprising accuracy. 
Similar considerations apply to automobile accidents, which are more frequent on 
Sundays;, ete. 

4For a non-homogeneous process we should have to introduce the probability 
P,(t, 2) that n changes occur in the interval t) < t < de. 

5 This condition can be dispensed with; see section 6. 
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1 — Pot 
OU) 208 


(2.1) . 


where ἃ is a positive constant. Then for a small interval of length h 
the probability of one or more changes is 1 — Po(h) = \h + o(h), 
where the term o(h) denotes a quantity which is of smaller order of 
magnitude than h. We now formulate our 


Postulates for the Poisson Process. Whatever the number of 
changes during (0, t), the probability that during (t, t-+h) a change occurs 
18 Ah + o(h), and the probability that more than one change occurs is o(h). 


These conditions easily lead to a system of differential equations for 
Ρ,(ἢ. Consider two contiguous intervals (0, ὁ) and (t,t-+h), where ἢ 
is small. If n > 1, then exactly n changes can occur in the interval 
(0, ἐ- ΕΔ) in three mutually exclusive ways: (1) no change during 
(t,é+h) and n changes during (0, ὃ; (2) one change during (¢, t+h) 
and n — 1 changes during (0, ὃ; (8) x > 2 changes during (ἐ, t-+h) 
and n — x changes during (0, ὃ. According to our hypotheses, the 
probability of the first contingency is P,,(é) times the probability of 
no change during (¢, +h) and this last is 1 — Ah — o(h). Similarly, 
the second contingency has probability P,_;()\h + o(h), and the last 
has a probability of smaller order of magnitude than h. This means 
that 


(2.2) Pat +h) = Pald)(1 — Ah) + Ρ, ()¥h + off) 
or 
ΠΟ ΠΥ ~ 


h 


Ash — 0, the last term tends to zero; hence the limit ὃ of the left side 
exists and | 


(2.4) Ρ', (ὃ = —APa(t) + APaal) (n > 1). 
For n = 0 the second and third contingencies mentioned above do not 


6 Since we restricted h to positive values, P’,(t) in (2.4) should be interpreted as 
a right-hand derivative. It is really an ordinary two-sided derivative. In fact, 
the term o(h) in (2.2) does not depend on ¢ and therefore remains unchanged when 
t is replaced by ἐ —h. Thus (2.2) implies continuity, and (2.3) implies differen- 
tiability in the ordinary sense. This remark applies throughout the chapter and 
will not be repeated. 
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arise, and therefore (2.4) is to be replaced by the simpler equation 


(2.5) Pott +h) = Po(t)(1 — Ah) + ofA), 
which leads to 
(2.6) P(t) = —dPolt). 


From (2.6) and Po(0) = 1 we get Po(t) = e*“. Substituting this 
Po(t) into (2.4) with n = 1, we get an ordinary differential equation 
for P,(t). Since P,(0) = 0, we find easily that P,(t) = Me, in 
agreement with the Poisson distribution (1.1). Proceeding in the same 
way, we find successively all terms of (1.1). 


3. THE PURE BIRTH PROCESS 


In the Poisson process the probability of a change during (¢, +h) 
is independent of the number of changes during (0, ). The simplest 
generalization consists of dropping this assumption. Assume instead 
that, when n changes occur during (0, ὅδ, the probability of a new 
change during (¢, +h) equals λον plus terms of smaller order of mag- 
nitude than h; the single constant ἃ characterizing the process is re- 
placed by the sequence Apo, Ai, Ag, .--- 

It is convenient to introduce a more flexible terminology. Instead 
of saying that n changes occur during (0, ¢), we shall say that the system 
is in state ἔκ. A new change then becomes a transition Ey, — Εἶθ... 
In a pure birth process transitions from EF, are possible only to En41. 
Such a process is characterized by the following 


Postulates. If αἱ time t the system is in state E, (n = 0, 1, 2, ...), 
then the probability that during (t, t+-h) a transition to E,41 occurs equals 
Anh + o(h); the probability of any other change 18 o(h). 


The salient feature of this assumption is that the time which the 
system spends in any particular state plays no role; there are sudden 
changes of state but no aging as long as the system remains within a 
single state. 

Again let P,(é) be the probability that at time ¢ the system is in 
state Ε,. The functions P,,(é) satisfy a system of differential equations 
which can be derived by the argument of the preceding section, with 
the only change that (2.2) is replaced by 


(8.1)  Pralé+h) = Ρ,(δᾷ — Anh) + Ῥα-«(ἢλ,- αἢ + o(h). 
In this way we get the basic system of differential equations 
P’n(t) = —AnP a(t) + An—1Pn—1(t) (n = 1), 


(3.2) 
P’o(t) = --λοβρο(ῦ. 
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We can calculate Po(é) first and then, by recursion, all P,(¢). If the 
state of the system represents the number of changes during (0, ὃ), 
then the initial state is Zo so that P)(0) = 1 and hence P)(t) = e—. 
However, the system need not start from state Eo [see example (3.b)]. 
If at time zero the system is in Z;, then we have 


(3.3) P;(0) = 1, P,(0) = 0 for n #1. 


These initial conditions uniquely determine the solution {P,(é)} of 
(3.2). (Cn particular, Po(t) = Ρι(ὃ =...= P;_1(t) = 0.) Explicit 
formulas for P,,(é) have been derived independently by many authors 
but are of no interest to us. It is easily verified that for arbitrarily 
prescribed A, the system {P,(t)} has all required properties, except 
that under certain conditions =P,(t) < 1. This phenomenon will be 
discussed in section 4. 


Examples. (a) Radioactive transmutations. A radioactive atom, 
say uranium, may by emission of particles or y-rays change to an atom 
of a different kind. Each kind represents a possible state of the sys- 
tem, and as the process continues, we get a succession of transitions 
Ey — Ey — ἔς >... E,. According to accepted physical theories, 
the probability of a transition Z, —> E,4, remains unchanged as long 
as the atom is in state H,, and this hypothesis is expressed by our 
starting supposition. The differential equations (3.2) therefore describe 
the process (a fact well known to physicists). If E,, is the terminal 
state from which no further transitions are possible, then \,, = 0 and 
the system (3.2) terminates with n = m. (Forn > m we get automati- 
cally P,(f) = 0.) 

(6) The Yule process. Consider a population of members which can 
(by splitting or otherwise) give birth to new members but cannot die. 
Assume that during any short time interval of length h each member has 
probability \h + o(h) to create a new one; the constant ἃ determines 
the rate of increase of the population. If there is no interaction among 
the members and at time ¢ the population size is n, then the probability 
of an increase during (¢, +h) is mAh + o(h). The probability P,(¢) 
that the population numbers exactly n elements therefore satisfies (3.2) 
with A, = nd, that is, 


(3.4) Ρ', (ἢ = —ndPr(t) i (n — 1)APn_1(t) (n > 1). 


If 7 is the population size at time ¢ = 0, then the initial conditions (3.3) 
apply. It is easily verified that for n > 7 the solution is given by 


(3.5) P,(t) = C - 7 ες Δίᾳ -- e-Myn—i 
nrm— tt 
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and, of course, P,(t) = Oforn <7. This distribution is a special case 
of the negative binomial distribution: using the definition VI(8.1) we 
may rewrite (3.5) as P,(t) = f(n—1; 7, δ). Τὸ follows [cf. example 
IX(3.c)] that the population size at time ¢ is the sum of 7 independent 
random variables each having the distribution obtained from (3.5) on 
replacing 2 by 1. These ὦ variables represent the progenies of the 7 
original members of our population. | 

This type of process was first studied by Yule’ in connection with 
the mathematical theory of evolution. The population consists of the 
species within a genus, and the creation of a new element is due to 
mutations. The assumption that each species has the same probability 
of throwing out a new species neglects the difference in species sizes. 
Since we have also neglected the possibility that a species may die out, 
formula (3.5) can be expected to give only a crude approximation. 
Furry ὃ used the same model to describe a process connected with cosmic 
rays, but again the approximation is rather crude. The differential 
equations (3.4) apply strictly to a population of particles which can 
split into exact replicas of themselves, provided, of course, that there 
is no interaction among particles. 


*4, DIVERGENT BIRTH PROCESSES 


The solution {P,(¢)} of the infinite system of differential equations 
(3.2) subject to initial conditions (3.3) can be calculated inductively, 
starting from P;(t) = es’. The distribution {P,(t)} is therefore 
uniquely determined. From the familiar formulas for solving linear 
differential equations it follows also that P,(t) > 0. The only question 


7G. Udny Yule, A mathematical theory of evolution, based on the conclusions 
of Dr. J. C. Willis, F.R.S., Philosophical Transactions of the Royal Society, London, 
Series B, vol. 213 (1924), pp. 21-87. Yule does not introduce the differential 
equations (3.4) but derives P,(t) by a limiting process similar to the one used in 
chapter VI, section 5, for the Poisson process. Much more general, and more 
flexible, models of the same type were devised and applied to epidemics and popu- 
lation growth in an unpretentious and highly interesting paper by Lieutenant 
Colonel A. G. M’Kendrick, Applications of mathematics to medical problems, 
Proceedings Edinburgh Mathematical Society, vol. 44 (1925), pp. 1-34. It is very 
unfortunate that this remarkable paper passed practically unnoticed. In particu- 
lar, it was unknown to the present author when he introduced various stochastic 
models for population growth in Die Grundlagen der Volterraschen Theorie des 
Kampfes ums Dasein in wahrscheinlichkeitstheoretischer Behandlung, Acta Bio- 
theoretica, vol. 5 (1939), pp. 11-40. 

8 On fluctuation phenomena in the passage of high-energy electrons through lead, 
Physical Reviews, vol. 52 (1937), p. 569. 

* This section treats a special topic and may be omitted. 


XVII.4] DIVERGENT BIRTH PROCESSES 405 


left open is whether {P,,(é)} is an honest probability distribution, that 
is, whether or not 


(4.1) ΣΡ,(ὃἢ = 1 


for all ὁ. We shall see that this is not always so: if the coefficients An 
increase sufficiently fast, then it may happen that 


(4.2) =P, (ὃ < 1. 


At first sight this possibility appears surprising and, perhaps, disturb- 
ing, but it finds a ready explanation. The left side in (4.2) may be 
interpreted as the probability that during time ¢ only a finite number 
of changes takes place. Accordingly, the difference between the two 
sides in (4.2) accounts for the possibility of infinitely many changes, 
or a sort of explosion. For a better understanding of this phenomenon 
let us compare our probabilistic model of growth with the familiar 
deterministic approach. 

The quantity A, in (3.2) could be called the average rate of growth 
at a time when the population size is n. For example, in the special 
case (3.4) we have A, = md, so that the average rate of growth is pro- 
portional to the actual population size. If growth is not subject to 
chance fluctuations and has a rate of increase proportional to the in- 
stantaneous population size, then z(t) varies in accordance with the 
deterministic differential equation 


dx(t) 
di 


(4.3) = Ax(t). 


It follows that at time ¢ the population size is 
(4.4) x(t) = ter, 


where 2 = (0) is the initial population size. The connection between 
(3.4) and (4.3) is not purely formal. It is readily seen that (4.4) actu- 
ally gives the expected value of the distribution (3.5), so that (4.3) 
describes the expected population size, whereas (3.4) takes account of 
chance fluctuations. 

Let us now consider a deterministic growth process where the rate 
of growth increases faster than the population size. To a rate of 
growth proportional to z(t) there corresponds the differential equation 


dx(t) 


(4.5) 7H 


= λῳφ (ὃ 
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whose solution is 
1 


(4.6) x(t) = are 


Note that x(¢) increases over all bounds as ¢ — 1/A7. In other words, 
the assumption that the rate of growth increases as the square of the 
population size implies an infinite growth within a finite time interval. 
Similarly, if in (3.4) the A, increase too fast, there is a finite probability 
that infinitely many changes take place in a finite time interval. A 
precise answer about the conditions when such a divergent growth 
occurs is given by the 


Theorem. In order that (4.1) may hold for all ἐ it is necessary and 
sufficient that the series 


1 
(4.7) De τ 
diverge. 


Proof. Letting 


(4.8) Sz(t) = Po(t) +...+ Pi, 
we get from (3.2) 
(4.9) δ'᾽κ(ἢ = —AgPx() 


and hence for k > 7 


t 
(4.10) 1 — S,(t) = mf P,(r) dr. 
0 


Since all terms in (4.8) are non-negative, the sequence S;(t)—for 
fixed t—can only increase with k, and therefore the right side in (4.10) 
decreases monotonically with k. Call its limit u(t). Then for k > 7 


t 
(4.11) Ak f P;(r) dr = p(t) 
0 
and hence 
4.12 S )dr > (<4 : + +=) 
(4.12) J n(T) dr = μ ae or cee rn, 


Because of (4.10) we have S,(é) < 1, so that the left side in (4.12) is 
at most ¢t. If the series (4.7) diverges, the second factor on the right 
in (4.12) tends to infinity, and the inequality can hold only if u(t) = 0 
for all ¢. In this case the right side in (4.10) tends to zero as k -- οὐ, 
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and therefore S,(t) — 1, so that (4.1) holds. Conversely,’ integrating 
(4.8) and using (4.10) we see that the left side of (4.12) is less than 
λο +A + +...+A,71. If the series (4.7) converges, this expres- 
sion is bounded and hence it is impossible that S,(£) — 1 for all 1. 


5. THE BIRTH AND DEATH PROCESS 


The pure birth process of section 3 provides a satisfactory description 
of radioactive transmutations, but it cannot serve as a realistic model 
for changes in the size of populations whose members can die (or drop 
out). This suggests generalizing the model by permitting transitions 
from the state H,, not only to the next higher state En41 but also to 
the next lower state E,_,;. (More general processes will be defined in 
section 9.) Accordingly we start from the following 


Postulates. The system changes only through transitions from states 
to their next neighbors (from E,, to En41 or En if n > 1, but from Eo 
to Ey only). If at any time t the system is in state En, the probability 
that during (t, t-+-h) the transition E, --Ὁ En. occurs equals X»h + o(h), 
and the probability of En — En_y (if n > 1) equals yah + o(h). The 
probability that during (t, t++h) more than one change occurs is oth). 


It is easy to adapt the method of section 2 to derive differentia] 
equations for the probabilities P,,(é) of finding the system at time ¢ in 
state H,. To calculate P,(¢ + h), note that at time ¢ + h the system 
can be in state E,, only if one of the following conditions is satisfied: 
(1) At time ¢ the system is in Z, and during (ὦ, t+-h) no change occurs; 
(2) at time ¢ the system is in Ε΄... and a transition to E,, occurs; (3) at 
time ¢ the system is in H,,,, and a transition to EZ, occurs; (4) during 
(t, t-+h) two or more transitions occur. By assumption, the probability 
of the last event is o(h). The first three contingencies are mutually ex- 
clusive, so that their probabilities add. Therefore 


(5.1) Pr(t +h) = Pr) {1 — Anh — μι} + 
Ἢ λη- απ, χ(ἢ + pasihPn4i(t) + o(h). 


Transposing the term P,(#) and dividing the equation by h, we get on 
the left the difference ratio of P,(é). Letting h --» 0, we get 


(5.2) P’n(t) = —(An + Bn) P(t) + λ,..},. «(ἢ + Mn4iPnii(t). 


* By a regrettable oversight the following three lines were missing in the first 
printing of the first edition and part of the preceding argument was repeated instead. 
The error was corrected after a few months. (The present discussion 1s continued 
in section 10.) 
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This equation holds for n > 1. For n = 0 in the same way 


(5.3) P’o(t) = —AoPo(t) + mi Pi(d). 
If at time zero the system is in state E;, the initial conditions are 
(5.4) P;(0) = 1, P,(0) = 0 for n #1. 


The birth and death process is thus seen to depend on the infinite 
system of differential equations (5.2)—(5.3) together with the initial 
condition (5.4). The question of existence and of uniqueness of solu- 
tions is in this case by no means trivial. In a pure birth process the 
system (3.2) of differential equations was also infinite, but it had the 
form of recurrence relations; Po(é) was determined by the first equa- 
tion and P,,(t) could be calculated from P,_1(/). The new system 
(5.2) is not of this form, and all P,(¢) must be found simultaneously. 
We shall here (and elsewhere in this chapter) state properties of the 
solutions without proof.” 

For arbitrarily prescribed coefficients \n = 0, un = 0 there always extsts 
a positive solution {P,(t)} of (5.2)-(5.4) such that ZP,() <1. Lf the 
coefficients are bounded (or increase sufficiently slowly), this solution is 
unique and satisfies the regularity condition ΣΡ (ἢ = 1. However, it 
is possible to choose the coefficients in such a way that 2P,(t) < 1 
and that there exist infinitely many solutions. In the latter case we 
encounter a phenomenon analogous to that studied in the preceding 
section for the pure birth process. This situation is of considerable 
theoretical interest, but the reader may safely assume that in all 


10 A simple existence proof and uniqueness criterion (although using Laplace 
transforms) applicable to the most general equations of this chapter is contained 
in section 4 of W. Feller, On boundary conditions for the Kolmogorov differential 
equations, Annals of Mathematics, vol. 65 (1957), pp. 527-570. The first existence 
proof was given in The integrodifferential equations of completely discontinuous 
Markov processes, Transactions American Mathematical Society, vol. 48 (1940), 
pp. 488-515. Unfortunately this paper treats the general case of non-denumerable 
sample spaces and time dependent coefficients, and it has generally been over- 
looked that the specialization to the case of ordinary differential equations with 
constant coefficients treated in this chapter leads to a simple existence proof. 

11 Solutions of the birth and death process such that 2P,(f) < 1 have recently 
attracted wide attention. See W. Ledermann and G. E. Reuter, Spectral theory 
for the differential equations of simple birth and death processes, Philosophical 
Transactions Royal Society, London, Ser. A, vol. 246 (1954), pp. 321-369; S. Karlin 
and J. McGregor, Representation of a class of stochastic processes, Proceedings 
National Academy Sciences, USA (6) vol. 4 (1955), pp. 387-391; forthcoming papers 
by the same authors and another by B. O. Koopman, both in the Transactions 
American Mathematical Society. 
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cases of practical significance the conditions of uniqueness are satis- 
fied; in this case automatically ΣΡ, (ἢ) = 1 (see section 10). 

When Apo = 0 the transition Ey — E;, is impossible. In the termi- 
nology of Markov chains Ep is an absorbing state from which no exit is 
possible; once the system is in Ep it stays there. From (5.3) it follows 
that in this case P’o(t) > 0, so that Po(t) increases monotonically. 
The limit Po(%) is the probability of ultimate absorption. 

More generally, it can be shown that the limits 


(5.5) lim P,(t) = pn 
t— ὦ 


exist and are independent of the initial conditions (5.4); they satisfy the 
system of linear equations obtained from (5.2)-(5.3) on putting 
Ρ', (ὃ = 0. The relation (5.5) is usually interpreted as a “tendency 
toward the steady state condition” and this suggestive name has caused 
much confusion. It must be understood that, except when Zp is an 
absorbing state, the chance fluctuations continue forever unabated and 
(5.5) shows only that in the long run the influence of the initial condi- 
tion disappears. The remarks made in chapter XV, section 6, con- 
cerning the statistical equilibria apply here without change. 

The truth of (5.5) can be proved either from explicit formulas for 
the P,(t) or from general ergodic theories. Intuitively the theorem 
becomes almost obvious by a comparison of our process with a simple 
Markov chain with transition probabilities 


(5 6) = An _ Ln : 

᾿ core An 1 Lin eo Xn + Bn 
In this chain the only direct transitions are E,, — E,,4; and E, —> En-1, 
and they have the same conditional probabilities as in our process; the 
difference between the chain and our process lies in the fact that, with 
the latter, changes can occur at arbitrary times, so that the number of 
transitions during time ¢ is a random variable. However, for large ἐ 
this number is certain to be large, and hence it is plausible that for 
t —» « the probabilities P,(¢) behave as the corresponding probabilities 
of the simple chain. 

The principal field of applications of the birth and death process is 
to problems of waiting times, trunking, etc.; see sections 6 and 7. 


Examples. (a) Linear growth. Suppose that a population consists 
of elements which can split or die. During any short time interval of 
length h the probability for any living element to split into two is 
dh + o(h), whereas the corresponding probability of dying is uh + o(h). 
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Here ἃ and μ are two constants characteristic of the population. If 
there is no interaction among the elements, we are led to a birth and 
death process with A, = nA, Wn = nu. The basic differential equations 
take on the form 


P’o() =; uP, (0d), 
Ρ', (ὃ = — (A+ pv)nPrlt) + An — 1)Pai) + μήν + I)Pa4i ld. 


Explicit solutions can be found ” (cf. problems 9-11), but we shall 
not discuss this aspect. The limits (5.5) exist and satisfy (5.7) with 
P’,() = 0. From the first equation we find p; = 0, and we see by 
induction from the second equation that p, = 0 for alln>1. If 
Po = 1, we may say that the probability of ultimate extinction is 1. 
If po < 1, the relations py = po ...= 0 imply that with probability 
1 — po the population increases over all bounds; ultimately the popu- 
lation must either die out or increase indefinitely. To find the prob- 
ability po of extinction we compare the process to the related Markov 
chain. In our case the transition probabilities (5.6) are independent 
of n, and we have therefore an ordinary random walk in which the 
steps to the right and left have probabilities p = \/(A + μὴ) and 
ᾳ = μ΄ -Ἐ pn), respectively. The state Zo (or x = 0) is an absorbing 
barrier. We know from the classical ruin problem (see chapter XIV, 
section 2) that the probability of extinction is 1 if p < q and (q/p)’ if 
ᾳ < p and ¢ is the initial state. We conclude that in our process the 
probability po = lim Po(t) of ultimate extinction is 1 if ἃ “ μ, and 
(u/d)” of X > w. (This is easily verified from the explicit solution; see 
problem 10.) 

As in many similar cases, the explicit solution of (5.7) is rather com- 
plicated, and it is desirable to calculate the mean and the variance 
of the distribution {P,,(¢)} directly from the differential equations. We 
have for the mean 


(5.7) 


(5.8) M(t) = >= nP, (2). 

N=] 
We shall omit a formal proof that M(t) is finite and that the following 
formal operations are justified (again both points follow readily from 


22 A systematic way consists in deriving a partial differential equation for the 
generating function 2P,(t)s". A more general process where the coefficients ἃ 
and u in (5.7) are permitted to depend on time is discussed in detail in David G. 
Kendall, The generalized “birth and death” process, Annals of Mathematical Sta- 
tistics, vol. 19 (1948), pp. 1-15. See also the same author’s Stochastic processes 
and population growth, Journal of the Royal Statistical Society, B, vol. 11 (1949), 
pp. 230-265 where the theory is generalized to take account of the age distribution 
in biological populations. 
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the solution given in problem 10). Multiplying the second equation 
in (5.7) by nm and adding over n = 1, 2, ..., we find that the terms 
containing n” cancel, and we get 


(5.9) M’@) = AZB(n — 1)}Ρ,..κ(ἢ — w(n + YPapi) = 
= (ὶ -- pM). 


This is a differential equation for ΜᾺ). At time ¢ = 0 the population 
_ size is t, and hence M(Q) = 7. Therefore 


(5.10) M(t) = ie®—#, 


We see that the mean tends to 0 or infinity, according as ἃ < μ or 
4 >uy. The variance of {P,(é)} can be calculated in a similar way 
(ef. problem 12). 

(b) Watting lines for a single channel. In the simplest case of con- 
stant coefficients An = A, μη = pw the birth and death process reduces 
to a special case of the waiting line example (7.b) when a = 1. 


6. EXPONENTIAL HOLDING TIMES 


The principal field of applications of the pure birth and death proc- 
ess Js connected with trunking in telephone engineering and various 
types of waiting lines for telephones, counters, or machines. This type 
of problem can be treated with various degrees of mathematical so- 
phistication. The method of the birth and death process offers the 
easiest approach, but this model is based on a mathematical simplifica- 
tion known as the assumption of exponential holding times. We begin 
with a discussion of this basic assumption. 

For concreteness of language let us consider a telephone conversa- 
tion, and let us assume that its length is necessarily an integral number 
of seconds. We treat the length of the conversation as a random 
variable X and assume its probability distribution p, = P{X = n} 
known. The telephone line then represents a physical system with 
two possible states, “busy” (Zo) and “free” (£;). If at an arbitrary 
moment ἐ the line is busy, then the probability of a change in state 
during the next second depends on how long the conversation has been 
going on. In other words, the past has an influence on the future, 
and our process is therefore not a Markov process (see chapter XV, 
section 10). This circumstance is the source of most difficulties in 
more complicated problems. However, there exists a simple exceptional 
ease discussed at length in chapter XIII, section 9. 

Imagine that the decision whether or not the conversation is to be 
continued is made each second at random by means of a skew coin. 
In other words, a sequence of Bernoulli trials with probability p of suc- 
cess is performed at a rate of one per second and continued until the 
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first success. The conversation ends when this first success occurs. In 
this case the total length of the conversation, the “holding time,” has 
the geometric distribution p, = q"~'p. If at any time ¢ the line is 
busy, the probability that it will remain busy for more than one sec- 
ond is g, and the probability of the transition Hy — EF, at the next 
step is p. These probabilities are now independent of how long the 
line was busy. 

Without discretizing the time parameter we have to deal with con- 
tinuous random variables. The role of the geometric distribution for 
waiting times is then taken over by the exponential distribution. It is 
the only distribution having a Markovian character, that is, endowed 
with complete lack of memory. In other words, the probability that 
a conversation which goes on at time x continues beyond z + ἢ is inde- 
pendent of the past duration of the conversation if, and only if, the 
probability that the conversation lasts for longer than ¢ time units is 
given by an exponential e—*‘. We have encountered this “exponential 
holding time distribution” as the zero term in the Poisson distribution 
(1.1), that is, as the waiting time up to the occurrence of the first 
change. 

The method of the birth and death process is applicable only if the 
transition probabilities in question do not depend on the past; for 
trunking and waiting line problems this means that all holding times 
must be exponential. From a practical point of view this assumption 
may at first sight appear rather artificial, but experience shows that it | 
reasonably describes actual phenomena. In particular, many measure- 
ments have shown that telephone conversations within a city 13 follow 
the exponential law to a surprising degree of accuracy. The same 
situation prevails for other holding times (e.g., the duration of machine 
repairs). 7 

It remains to characterize the so-called incoming traffic (arriving 
calls, machine breakdowns, etc.). We shall assume that during any 
time interval of length h the probability of an incoming call is \h plus 
negligible terms, and that the probability of more than one call is in 
the limit negligible. According to the results of section 2, this means 
that the number of incoming calls has a Poisson distribution with mean 
At. We shall describe this situation by saying that the incoming traffic 
48 of the Poisson type with intensity Δ. 


18 For conversations between cities, companies usually charge by intervals of 
three minutes, and the holding times are therefore likely to be multiples of three 
minutes. This is a systematic deviation from the exponential law, and our theory 
does not apply. 
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It is easy to verify the described property of exponential holding times. Denote 
by u(t) the probability that a conversation lasts for at least ¢ time units. The 
probability u(@ - 8) that a conversation starting at time 0 lasts beyond t - 8 
equals the probability that it lasts longer than ¢ units multiplied by the conditional 
probability that a conversation lasts additional s units, given that its length ex- 
ceeds t. If the past duration has no influence, the last conditional probability must 
equal u(s); that is, we must have | 


(6.1) u(t + 8) = u(t) u(s). 
It remains to prove the 


Theorem. Let u(t) be defined for t > 0 and bounded in each finite interval. If 
u(t) satisfies (6.1), then either u(t) = Ο for all t, or u = e— for some constant d. 


Proof. If u(t) does not vanish identically, there exists a point x such that u(x) > 0. 
Let \ = —log u(z) and v(t) = eu(zt). Then 


(6.2) v(t +s) = v(f) v(s), v1) = 1 


and we shall prove that v(t) = 1 for allt > 0. Clearly v°(3) = v(1) = 1, and gen- 
erally v"(1/n) = v(1) = 1 for each integer n > 0. Therefore v(1/n) = 1 and thence 
v(m/n) = v™(1/n) = 1 for each pair of integers m > 0, n > 0. Hence v(r) = 1 for 
each rational r. Suppose now that o(r) = ¢ #1. Then υ(γ ἢ) = c~ and we may 
assume c > 1. In this case v(Nr) = v%(r) = c% can be made arbitrarily large by 
choosing N sufficiently large. Now choose a rational r in the interval 


Nr-1<r<WNr. 
Then 


(6.3) vo(Nr -- τ) = o(Nr — 1) o(r) = (Nr) = c% 


which shows that there exist points a = Nr —r in the interval 0 < a < 1 such 
that v(a) >c%. This contradicts the assumption that u(é), and therefore v(2), 
are bounded in each finite interval. 


7. WAITING LINE AND SERVICING PROBLEMS 


(a) The simplest trunking problem. Suppose that infinitely many 
trunks or channels are available, and that the probability of a conver- 


4C. Palm, Intensitéitsschwankungen im Fernsprechverkehr, Ericsson Technics 
(Stockholm), no. 44 (1943), pp. 1-189, in particular p. 57. Waiting line and trunk- 
ing problems for telephone exchanges were studied long before the theory of sto- 
chastic processes was available and had a stimulating influence on the development 
of the theory. In particular, Palm’s impressive work over many years has proved 
useful to several authors. The earliest worker in the field was A. K. Erlang (1878-- 
1929). See E. Brockmeyer, H. L. Halstrém, and Arne Jensen, The life and works 
of A. K. Erlang, Transactions of the Danish Academy Technical Sciences, No. 2, 
Copenhagen, 1948. Independently valuable pioneer work has been done by T. C. 
Fry whose book, quoted in footnote 4 of chapter VI, did much for the develop- 
ment of engineering applications of probability. 
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sation ending during the interval (¢, +A) is wh plus terms which are 
negligible as ἢ — 0 (exponential holding time). The incoming calls 
constitute a traffic of the Poisson type with parameter \. The system 
is in state E,, if n lines are busy. 

It is, of course, assumed that the durations of the conversations are 
mutually independent. If nm lines are busy, the probability that one 
of them will be freed within time h is then nuh + o(h). The probability 
that within this time two or more conversations terminate is obviously 
of the order of magnitude h? and therefore negligible. The probability 
of a new call arriving is λὴ + o(h). The probability of a combination 
of several calls, or of a call arriving and a conversation ending, is again 
o(h). Thus, in the notation of section 5 


(7.1) An =A, Un = Ny. 
The basic differential equations (5.2)—(5.3) take the form (n > 1) 
(7.2) P’o(t) = —APo(t) + Pi) 

Ρ',(ἢ = —A+ ημ)}Ρ,(ὃ + Pail) + (n + DePa4il). 


Explicit solutions can be obtained by deriving a partial differential 
equation for the generating function (cf. problem 13). We shall only 
determine the quantities p, = lim P,(t) of (5.5). They satisfy the 
equations 


λρο = MP1 
(A + 14)Pn = APn-1 + (Nn + 1)μ0,.,1. 
We find by induction that p, = po(A/u)”"/n!, and hence 


e7 Ale Q/m)" 


ni 


(7.3) 


Thus, the lumiting distribution is a Potsson distribution with parameter 
A/p. It ts independent of the initial state. 


It is easy to find the mean M(t) = 2nP,(t). Multiplying the nth equation of 
(7.2) by n and adding, we get, taking into account that the P,(/) add to unity, 


(7.5) M(t) => — μΜῷ. 
When the initial state is #;, then M(0) = 7, and 
A 
(7.6) : M(t) = -(1 — e~#4) + te**, 
μ 
Ast - οὐ, we see that M(é) approaches the mean of the Poisson distribution found 


above. Incidentally, the reader may verify that in the special case ὦ = 0 the 
P,(é) are given exactly by a Poisson distribution with mean M(). 
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(ὁ) Waiting lines for a finite number of channels.’ We now modify 
the last example to obtain a more realistic model. The assumptions 
are the same, except that the number a of trunklines or channels is finite. 
If all channels are busy, each new call joins a waiting line and waits until 
a channel is freed. This means that all trunklines have a common wait- 
ing line. 

The word “trunk” may be replaced by counter at a postoffice and 
“conversation” by service. We are actually treating the general waiting 
line problem for the case where a person has to wait only if all a chan- 
nels are busy. 

We say that the system is in state E,, if there are exactly n persons either 
being served or in the waiting line. Such a line exists only when n > a, 
and then there are n — a persons in it. 

As long as at least one channel is free, we are in exactly the same 
situation as in the preceding example. However, if the system is in a 
state H, with n > a, only a conversations are going on, and we have 
therefore u, = ἀμ, for n >a. The basic system of differential equa- 
tions is therefore given by (7.2) for n < a, but for n > a by 


(7.1 P’a(t) = - ἨΔ + ap)Pa(t) + ΧΡ, χ(ῦ + ἀμΡ, εχ. 


In the special case of a single channel (a = 1) these equations reduce 
to those of a birth and death process with coefficients independent of n. 
The limits p, of (5.5) exist; they satisfy (7.3) for n < a, and 


(7.8) A + ἀμ)Ρη = Ἀρ,..ι + ANPn41 
forn >a. By recursion we find again that 
λ nr 
(7.9) Pn = Do ees n<a 
ni 
d/p)” 
(7.10) Pn = sw Po: n> a. 
ala"~* 


The series 2 (pn/po) converges only if 


λ 
(7.11) - - α. 

μ 
Hence, if (7.11) does not hold, a limiting distribution {p;} cannot exist. 
In this case py, = 0 for all n, which means that gradually the waiting line 
grows over all bounds. On the other hand, if (7.11) holds, then we can 


* A. Kolmogoroff, Sur le probléme d’attente, Recueil Mathématique [Sbornik], 
Vol. 38, 1931, pp. 101--106. | 
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determine po so that the sum of the expressions (7.9) and (7.10) equals 
unity. From the explicit expressions for P,,(¢), which we have not de- 
rived, however, it can be shown that the p, thus obtained really repre- 
sent the limiting distribution of the P,(t). Table 1 gives a numerical 
illustration for a = 3, λίμ = 2. 


TABLE 1 


Lim1rinG PROBABILITIES IN THE CASE OF a = 3 CHANNELS AND A/p = 2 


n 0 1 2 3 4 5 6 7 
Lines busy 0 1 2 3 3 3 3 3 
People waiting 0 0 0 0 1 2 3 4 
Dn 0.1111 0.2222 0.2222 0.1481 0.0988 0.0658 0.0439 0.0293 


(c) Servicing of machines. The results derived in this and the next 
example are being successfully applied in Swedish industry. For orien- 
tation we begin with the simplest case and generalize it in the next 
example. The problem is as follows. 

We consider automatic machines which normally require no human 
care. However, at any time a machine may break down and call for 
service. The time required for servicing the machine is again taken 
as a random variable with an exponential distribution. In other words, 
the machine is characterized by two constants Δ and μ with the follow- 
ing properties. If at time ¢ the machine is in working state, the prob- 
ability that it will call for service before time ¢ + h is Ah plus terms 
which are negligible in the limit h — 0. Conversely, if at time ¢ the 
machine is being serviced, the probability that the servicing time ter- 
minates before ¢ + h and the machine reverts to the working state is 
uh + o(h). For an efficient machine ) should be relatively small and 
μ relatively large. The ratio \/p is called the servicing factor. 

We suppose that m machines with the same parameters ἃ and μ and 
working independently are serviced by a single repairman. A machine 
which breaks down is serviced immediately unless the repairman is 
servicing another machine, in which case a waiting line 15 formed. We 
say that the system is in state E, 1} n machines are not working. For 
1 <n < m this means that one machine is being serviced and n — 1 
are in the waiting line; in the state Ho all machines work and the 
repairman is idle. 


16 Examples (c) and (ὦ), including the numerical illustrations, are taken from an 
article by C. Palm, The distribution of repairmen in servicing automatic machines 
(in Swedish), Industritidningen Norden, vol. 75 (1947), pp. 75-80, 90-94, 119-123. 
Palm gives tables and graphs for the most economical number of repairmen. 
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A transition E,, — E, 41 1s caused by a breakdown of one among the 
m — n working machines, whereas a transition E, — E,_; occurs if 
the machine being serviced reverts to the working state. Hence we 
have a birth and death process with coefficients 
(7.12) An = (m — η)λ, Mo = 0, M1 = Mo =...= Um = μ 


and the basic differential equations (5.2) and (5.3) become (1 <n < 
« κι -- 1): 


P'o(t) = —mdPo(t) + μΡι(ῦ, 
P'n(t) = —{(m — Ὡ)λ + μ}Ρ,(ὃ + (m — ἡ + 1)APa_a() + 
(7.13) 
+ uPrii(d), 
Ρ' (ἢ ice —pP,, (ἢ - APm—i(@). 


This is a finite system of differential equations and can be solved by 
ordinary methods. The limits (5.5) exist and satisfy the equations 


MAPo = μι, 
(7.14) {(m—n)rA + wh pn = (m — 12+ 1)λ},..« + wpa4t, 
μῬηι = Dm—1- 
It follows easily that the recursion formula 
(7.15) (m — n)\pn = UPn41 


holds. Substituting successively n = m—1, m—2, ..., 1, 0, we get 


7.16) LuV" 
a Pmt = 55 (5) «Pm 


The remaining unknown constant p,, can be obtained from the condi- 
tion that the p; add to unity: 


cay ne [eh OY" 
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Formula (7.16) is well known among trunking engineers as Erlang’s 

loss formula. Typical numerical values are exhibited in table 2. 
TABLE 2 


PROBABILITIES py, FOR THE CASE )\/y = 0.1, m = 6 
(Ertane’s Loss Formuta) 


‘Machines in 


n Waiting Line Pn 

0 0 0.4845 
1 0 . 2907 
2 1 1454 
3 2 .0582 
4 3 .0175 
5 4 .0035 
6 5 .0003 


The probability 00 may be interpreted as the probability of the re- 
pairman’s being idle (in the example of table 2 he should be idle about 
half the time). The expected number of machines in the waiting line is 


(7.17) w= Σὺ (k— 1)pe = Do kpe — (1 — po). 
k=1 k==1 


This quantity can be calculated by adding the relations (7.15) for 
n= 0,1, ...,m. Using the fact that the p, add to unity, we get 


mr — dw — Δ] — po) = u(1 — po) 
or | | 


λ 
(7.18) pean ~.! ( -- ρρ.. 


In the example of table 2 we have w = 6-(0.0549). Thus 0.0549 is the 
average contribution of a machine to the waiting line. 

(d) Continuation: several repairmen. We shall not change the basic 
assumptions of the preceding problem, except that the m machines are 
now serviced by r repairmen (r <m). Thus for n <r the state Z,, 
means that r — n repairmen are idle, n machines are being serviced, 
and no machine jis in the waiting line for repairs. For n > r the state 
Κ᾽, signifies that r machines are being serviced and n — r machines are 
in the waiting line. We can use the setup of the preceding example 
except that (7.12) is obviously to be replaced by 
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Ao = Md, μο = 0, 
(7.19) An = (m — n)a, μη = Np (l<n<n), 
An ΞΞ (πὶ -- λ, μῃ = TH (ir<n<m). 


We shall not write down the basic system of differential equations but 
only the equations for the limiting probabilities p,. They are 


MADo = UP1, 
(7.20) {(m —n)\ + nu}pn = (m — n+ 1)Mpna + (n+ Depa 
(l<n<pn), 
{(m — n)X+ rh} Pn = (πὶ -- n+ dApoat repr 
r<in<m). 


From the first equation we get the ratio of p;/po. From the second 
equation we get by induction for n <r 


(7.21) (2 + 1)upn41 = (mM — Ὠ)λρη; 
finally, for n > r we get from the last equation in (7.20) 
(7.22) THPn41 = (mM — N)Npn. 


These equations permit calculating successively the ratios p,/po. 
Finally, po follows from the condition Zp, = 1. The values in table 3 
are obtained in this way. : 


TABLE 3 


PROBABILITIES ~, FOR THE Case λίμ = 0.1, m = 20, r = 3 


Machines Machines Repairmen 

Serviced Waiting Idle Dn 

0.13625 
. 27250 
. 25888 
. 15533 
.08802 
.04694. 
.02347 
.01095 
.00475 
.00190 
.00070 
.00023 
.00007 


Ὁ 00 "α δ. σι ἦν WHS = 
ὡϑ οὐ Οὐ ὡὐ Οὐ Οὐ Οὐ Οὐ οὐ ὁ Ὁ π ὦ 
WONMUAPRWNHOOCOO 
SecceccdcococoHrnwwe 
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A comparison of tables 2 and 3 reveals surprising facts. Note that 
both tables refer to the same machines (A/u = 0.1), but in the second 
case we have m = 20 machines and r = 3 repairmen. The number of 
machines per repairman has increased from 6 to 6%, but at the same 
time, the machines are serviced more efficiently. Let us define a 
coefficient of loss for machines by 


(7.23) w average number of machines in waiting line 
m number of machines 


and a coefficient of loss for repairmen by 
p average number of repairmen idle 


(7.24) -= 
number of repairmen 


= 


For practical purposes we may identify the probabilities P,,(£) with 
their limits p,. In table 3 we have then w = ρᾳ + 2p; + 39g +...+ 
+ I7peo and p = 3p9 + 2p; + po. Table 4 proves conclusively that 


TABLE 4 


COMPARISON OF EFFICIENCIES OF Two Systems DiscussED IN 
EXAMPLES (c) AND (d) 


I II 
Number of machines 6 20 
Number of repairmen 1 3 
Machines per repairman 6 63 
Coefficient of loss for repairmen 0.4845 0.4042 
Coefficient of loss for machines 0.0549 0.01694 


for our particular machines for which (A/u = 0.1) three repairmen per 
twenty machines are ever so much more economical than one repairman 
per six machines. Palm’s tables referred to in footnote 16 enable us to 
find the most economical ratio of repairmen per machine. 

(6) A power-supply problem.” One electric circuit supplies a welders 
who use the current only intermittently. If at time ¢ a welder uses cur- 
rent, the probability that he ceases using it at time ὁ + h is ph + o(h); 
if at time ¢ he requires no current, the probability that he calls for cur- 
rent before t+ ἢ is Ah + o(h). The welders work independently of 
each other. 


11 This example was suggested by the problem treated (inadequately) by H. A. 
Adler and K. W. Miller, A new approach to probability problems in electrical 
engineering, Transactions of the American Institute of Electrical Engineers, vol. 65 
(1946), pp. 680-632. 
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We say that the system is in state Ε΄, if n welders are using current. 
Thus we have only finitely many states Eo, ..., Eg. 

If the system is in state E,, then a — n welders are not using current 
and the probability for a new call for current within time h is 
(a — n)Ah + o(h); on the other hand, the probability that one of the n 
welders ceases using current is nuh + o(h). Hence we have a birth 
and death process with 


(7.25) An = (a — n)a, Hn = Np, 0<n<a. 
The basic differential equations become (1 < n < a — 1) 
P'o(t) = —adPo(t) + uPi(d), 
(7.26) P'a(t) = --ἰίημ + (a — n)A}Pp) + (2 + ὩμΡ,μῷ + 
+ (a —n-+ 1)dAP,_1(8), 
Ρ',(ἢ = —apPa(t) + APa-i (2). 


It is easily verified that the limiting probabilities are given by the 
binomial distribution 


σαν) = (V5) 2) 
| των “ἢ 


a result which could have been anticipated on intuitive grounds. 


8. THE BACKWARD (RETROSPECTIVE) EQUATIONS 


In the preceding sections we were studying the probabilities P,,(t) 
of finding the system at time ἐ in state H,. This notation is convenient 
but misleading, inasmuch as it omits mentioning the initial state EZ; 
of the system at time zero. For theoretical purposes it is therefore 
more natural to introduce the notation P;,(t); this is the probability 
that the system 1s at tume t in state E,, given that at time zero it was in E;. 
The Pin(t) will be called transition probabilities. 

It must be emphasized that we have been studying these transition 
probabilities all along and that nothing is changed but notation. When 
the initial state is known to be E;, then {P;(t)} is the absolute prob- 
ability distribution at time t. When at time zero we have only a prob- 
ability distribution {q,;} for the initial state, then the probability of FE, 
at time ἐ is 


(8.1) Qn) = Σ φρικ(ῦ. 


In the case of the pure birth process and of the birth and death proc- 
ess, we found that for an arbitrary fixed i the transition probabilities 
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Pin(t) satisfy the basic differential equations (3.2) and (5.2). The sub- 
script ὁ appears only in the initial conditions, which should now be 
written 


(8.2) ᾿Ξ ο-ἶ for πὶ 
᾿ ἕὰ Πρ otherwise. 


These basic differential equations were derived by prolonging the 
time interval (0, ¢) to (0,¢+h) and considering the possible changes 
during the short time (ἐ, +A). We could as well have prolonged the 
interval (0, ¢) in the direction of the past and considered the changes 
during (—h, 0). In this way we get a new system of differential equa- 
tions in which n (instead of 7) remains fixed. 

Consider first the case of a pure birth process and let us neglect 
events whose probability tends to zero faster than h. If the system 
passed from EF; ( > 0) at time —A to E, at time ἐ, then at time 0 it 
was with probability 1 — o(h) either at LE; or at E;1,. By the method 
of sections 2 and 3 we conclude that 


(8.3)  Pin(t +h) = Pin(#)(1 — dah) + Pign(t)dsh + off). 
Hence for 7 > 0 the new basic system now takes the form 
(8.4) Ρ' ind al —riP int) ap AP iin (Zé). 


These equations are called the backward equations, and, for distinction, 
equations (3.2) are called the forward equations. The initial conditions 
are (8.2). (Intuitively one should expect that 


(8.5) Pind) = 0 ifn < 1, 


but pathological exceptions exist; see section 10). 

In the case of the birth and death process, if the system is at time 
-- in E,, then at time zero it should be in £;,,, E;, or E;_1, and the 
same argument leads to the backward equations 


(8.6) Plm@) = —Qe + wi) Pin@ + MP igin®) + μ᾽... κ(ῦ. 


These equations correspond to (5.2). 

It should be clear that the forward and backward equations are not 
independent of each other; the solution of the backward equations with 
the initial conditions (8.2) automatically satisfies the forward equa- 
tions, except in the rare situations where the solution is not unique. 
These connections are mentioned here only as a preparation for the 
general theory of the next section. | 
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Example. The Poisson process. In section 2 we have interpreted 
the Poisson expression (1.1) as the probability that exactly n calls 
arrive during any time interval of length ¢. Let us say that at time ¢ 
the system is in state E,, if exactly n calls arrive within the time interval 
from 0 to ¢. A transition from £; at t, to E, at tg means that πὶ — 1 
calls arrived during (1, ¢2). This is possible only if n > 7, and hence 
we have for the transition probabilities of the Poisson process 


λέ π-τὶ 
Pin() = eo YN if n> i, 
7 (n — 2)! 
(8.7) 
Pin(0) = 0 if n<it. 
The forward and backward equations are, respectively, 
(8.8) P' in(t) -- —AP in(t) a AP; n—1 (6) 
and 
(8.9) P'in(t) = —Pin(t) + APig1.n(2); 


and it is easily verified that (8.7) is a solution of both systems and 
satisfies the initial condition (8.2). 


9. GENERALIZATION; THE KOLMOGOROV EQUATIONS 


So far the theory has been restricted to processes in which direct 
transitions from a state E, are possible only to the neighboring states 
E,+4, and E,_,. Moreover, the processes have been time-homogeneous, 
that is to say, the transition probabilities P;,,(é) have been the same 
for all time intervals of length ¢. We now consider more general proc- 
esses in which both assumptions are dropped. _ 

As in the theory of ordinary Markov chains, we shall permit direct 
transitions from any state HE; to any state Z,. The transition prob- 
abilities are permitted to vary in time. This necessitates specifying 
the two endpoints of any time interval instead of specifying just its 
length. Accordingly, we shall write P;n(r, ¢) for the conditional prob- 
ability of finding the system at time t in state Ey, given that at a previous 
instant τ the state was E;. The symbol Pin(r, t) is meaningless unless 
τ <t. If the process is homogeneous in time, then P(r, ὃ) depends only 
on the difference ¢ — 7, and we can write Pen(t) instead of Pin(r, τ- ἢ 
(which is then independent of r). 

The principal property of our processes is the Markov property dis- 
cussed in chapter XV, section 10: Given the state of the system at any 
time, future changes are independent of the past. More precisely, 
consider three moments 7 < 8 < ¢ and suppose that at time τ the sys- 
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tem is in state E; and at time 8 in state ἔν. For an arbitrary process 
the (conditional) probability of finding the system at time ¢ in state 
E,, depends on both 7 and »; in other words, not only the ‘present 
state” H,, but also the past state Εἰ, has an influence on the state at 
time ¢. However, for a Markov process this is not so. For it the 
considered probability equals P,n(s, ἢ), the probability of a transition 
from E, at time 8 to EH, at time ἐ; the knowledge that at time τ < 8 
the system was in state HE; permits no inference about the future. 
This assumption leads directly to an important conclusion. The pas- 
sage from £; at time τ to Μ΄, at time ¢ must occur via some state Εν at 
time s, and for a Markov process the probability that the passage goes 
via a particular state E, is P(r, s)Pyn(s, t). It follows that we must 
have 


(9.1) P;,(, t) = » P(r, 8) Pin(s, t) 


wdentically for allr « 5 <t. This is the Chapman-Kolmogorov equation. 
It is the counterpart, for the case of a continuous time parameter, to equa- 
tion XV(10.3), which is valid when the time parameter assumes integral 
values only. 


It was shown in chapter XV, section 10, that the Chapman-Kol- 
mogorov equation does not hold for all stochastic processes. For our 
purposes we could take (9.1) as defining the class of processes with which 
we are concerned.% In fact, we shall add only regularity restrictions 
and derive our basic differential equations from (9.1). There is a prob- 
abilistic background leading up to the Chapman-Kolmogorov equation, 
but we need not refer to it; once (9.1) is given we can easily derive 
differential equations which determine the probabilities P;,,(¢) and can 
proceed in a purely analytical way. 

In the case of time-homogeneous processes, equation (9.1) assumes 
the simpler form 


(9.2) Pin(t + 8) = Σὺ P(t)Prynls). 


For the Poisson process this equation reduces to the convolution prop- 
erty of the Poisson distribution [example XI(2.c)]. 


18 The question of whether the Kolmogorov equation characterizes Markov proc- 
esses poses difficult problems requiring the study of the actual sample functions 
X(t). It should be borne in mind that we are using a short cut to obtain differ- 
ential equations for certain probabilities and are not analyzing the process in all 
its aspects. 
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We now introduce our fundamental regularity conditions which in 
an obvious way generalize the starting assumptions of the birth and 
death process. 


Assumption 1. To every state E, there corresponds a continuous 
function c,(t) > 0 such that ash — 0 


(9.3) 1 - Pan(t, t+) pn. © Cy(t). 
h . 

The probabilistic interpretation of (9.3) is obvious; if at time ¢ the 
system is in state H,, the probability that during (ἐ, ἐπ) a change 
occurs is c,(f)h + o(h). Analytically, relations (9.3) require that 
Pan(t, 8) - 1 ass — ἐ, and that Pan(t, x) has at x = ἐ a derivative. 
The function c,(t) plays the role of \, + μη in the birth and death 
process. In the case of a time-homogeneous process, c, 15 independent 
of ἐ. 


Assumption 2. To every pair of states E;, Ey, with 2 ¥ k there corre- 
spond transition probabilities p;,(t) (depending on time) such that as 
h—0 


Pyx(t, t+h | 
(9.4) Pa” —> ο(δρπί( (j ~ k). 


The p;,(t) are continuous in t, and for every fixed t, 7 
(9.5) De pelt) = 1, pjs(t) = 0. 
k 
Here p;,(t) can be interpreted as the conditional probability that, zf 


a change from £; occurs during (¢, +h), this change takes the ee 
from E; to ἔκ. In the birth and death process 


P52) = 


dj j 
9.6 741(6) = 
(9.6) 13,5 41( aan ere 


and p;z4(t) = 0 for all other combinations of 7 and k. For every fixed ¢ 
the p,,(t) can be interpreted as transition probabilities of a Markov 


chain. 

The two assumptions suffice to derive a system of backward equa- 
tions for the P;,(r7, ὃ, but for the forward equations we require in 
addition 


Assumption 3. For fixed k the passage to the limit in (9.4) 1s uniform 
with respect to 1. . 
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The necessity of this assumption is of considerable interest for the 
theory of infinite systems of differential equations and will be discussed 
in the next section. 

We proceed to derive differential equations for the P;z(7, t) as func- 
tions of ¢ and n (forward equations). From equation (9.1) we have 


(9.7) Ρικίτ, th) = 2) Paj(z, t)Pin(t, +h). 


Expressing the term P(t, +h) on the right in accordance with (9.3), 
we get 


Pit, t+h) — Pu (7, 0 


h = —cz,()Pz(7, t) + 


(9.8) 
1 
+ — Σ, Ῥρίτ, t)Pix(t, t-+h) Paws 
h jk 
where the neglected terms tend to 0 with h, and the sum extends over 
all 7 except 2 = k. We can now apply (9.4) to the terms of the sum. 
Since (by assumption 3) the passage to the limit is uniform in j, the 
right side has a limit. Hence also the left side has a limit, which means 
that Ῥικίτ, ἢ) has a partial derivative with respect to ἐ, and 
oP A(T t) 
(9.9) -- = --(ἢ Ῥικίτ, ἢ + > Ρηίτ, De) pje(2). 
7 
This 18 the basic system of forward differential equations. Note that 1 
and 7 are fixed so that we have (despite the formal appearance of a 
partial derivative) a system of ordinary differential equations for the 


infinite system of functions P,z(7, ἔ), Καὶ = 0,1, 2, .... The parameters 
ἢ and 7 appear only in the initial condition 


1 for k=17 


9.10 Py(t,7) = 
( Εἰ) 7) 0 otherwise. 


A system of backward equations can be obtained on similar lines, 
and the derivation is actually simpler since we can dispense with as- 
sumption 3 entirely. As for equations (9.3) and (9.4), it is more natural 
to use the forms 


1 — Pan(i—h, ὃ 
(9.3a) -------ο----- > «, (4) 
h 
4(t—h 
(9.4a) ee αι —> ο(δρχε( (j ~k). 


h 
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These relations can be shown to be equivalent to (9.3) and (9.4), but 
we shall simply start from (9.3a), (9.4a), and (9.5) as our basic assump- 
tions. Rewriting the Chapman-Kolmogorov equation (9.1) in the form 


(9.11) Py(r—h, ὃ = > P,,(7—h, 7) P,x(7, ὃ 


and using (9.3a) with n = 1, we get 


P, —h, t) — Pix(z, t 
Pi(r—h, ἢ — Par, ἢ = —c¢,(7)P (7, ὃ + 


(9.12) i 
1 
ἘΠῚ ΕΣ Σ Ρ,(τ--, τ)Ρ,κίτ, t) a ee 
h vt h 


Here h—!P,,(r—h, τ) — οι(τ)ρ (τ) and the passage to the limit in the 
sum to the right in (9.12) is always uniform. In fact, if N > 2 we have 


(9.18) O<h Σ᾽ Path) P(t, ὃ <b Σ᾿ Ρ, (τ---ἢ, τ) < 
yv=N+1 v=N+1 


N N 
< h*{1 oe > Ρ,νίτ--ἢ, T)} ead c;(r) {1 = » Di(7)}. 
y=-0 v==0 

In view of condition (9.5) the nght side can be made arbitrarily small 
by choosing N sufficiently large. It follows that a termwise passage 
to the limit in (9.12) is permitted and we obtain 

oP R(T t) 
(9.14) --- = ¢;(7)P(7, ἢ) — ο((Τ) DB ριν(τ) Ρυκίτ, ὃ. 

Τ ν 

This, together with the initial condition (9.10), is the basic system of 
backward differential equations. 

The two systems of differential equations were first derived by A. 
Kolmogorov,” who laid the foundations of the theory of Markov proc- 
esses. It has been shown 9 that there always exists a common solution 
{ Px(7, δ} of the two systems which satisfies the Chapman-Kolmogorov 
equation (9.1) and 


(9.15) Px(r,t)>0, 2) P(t,t) <1. 
k 


We know from the pure birth process (section 4) that the P;z(7, t) need 
not add to unity, the difference 1 — 2P,x(7, ¢) accounting for the pos- 


19 Uber die analytischen Methoden in der Wahrscheinlichkeitsrechnung, Mathe- 
matische Annalen, vol. 104 (1931), pp. 415-458. 
20 See footnote 10. 
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sibility of infinitely many transitions within the finite time interval 
(r,#). If 2Pu(r, ὃ = 1, the solution {P,z(r, ὃ} is unique, but in gen- 
eral different processes may satisfy the same forward and backward 
equation (see section 10). From the point of view of applications, the 
possibility of the inequality 2P,,,(7, t) < 1 may be safely disregarded. 


Example. Generalized Poisson process. Consider the case where all 
c(t) equal the same constant, c,(t) = d, and the pj, are independent of t. 
In this case the pj, are the transition probabilities of an ordinary 
Markov chain and (as in chapter XV) we denote its higher transition 
probabilities by p{?. 

From c,(t) = X, it follows that the probability of a transition occur- 
ring during the interval (é, +) is independent of the state of the sys- 
tem at time ἐ and equals Ah + o(h). This implies that the number of 
transitions within the interval (τ, ὃ has a Poisson distribution with 
parameter λίέ — τ). Given that exactly n transitions occurred, the 
(conditional) probability of a passage from 7 to k is pf. Hence 


~ A(t — 7)” 


(n) 
= n! Pik 


(9.16) Py (7, ὃ = e 7) 


(where, as usual, p{? = 1 and p? = 0 for 7 γέ k). It is easily verified 
that (9.16) is in fact a solution of the two systems (9.9) and (9.14) of 
differential equations satisfying the boundary condition (9.10). 

If, in particular, 


(9.17) px, = 0 for k <j, Dik = fej for k>j 


(9.16) reduces to the compound Poisson distribution of chapter XII, 
section 1. 


10. PROCESSES INVOLVING ESCAPES 


The example of the pure birth process (sections 3 and 4) proves that 
the transition probabilities P;,(¢) determined from the Kolmogorov 
differential equations do not necessarily add to unity; it can happen that 


(10.1) >) P(t) < 1. 
k 


At the time of its discovery, in 1940, this phenomenon came as a dis- 
turbing surprise. A huge literature has been devoted to it and the re- 
lated fact that the Kolmogorov differential equations do not always 
determine a unique set of transition probabilities P,,(é). Processes 
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with these properties are usually called pathological.24_ With a better 
understanding there came the realization that we are really confronted 
with a simple and natural analogue to the familiar situation in diffusion 
theory. The occurrence of (10.1), no longer appeared disturbing, but 
led to the gratifying discovery that the theory of the Kolmogorov dif- 
ferential equations shares the basic features of diffusion theory. Despite 
the completely different appearance of the basic equations and the ana- 
lytical apparatus involved, we encounter in both theories the same type 
of boundary conditions and other similarities; each theory is better 
understood in the light of the other, and no sharp boundaries can in 
fact be drawn. In this way the theory of Markov processes has 
achieved an unexpected and pleasing internal unity. 

Let us reconsider the simple pure birth process of section 3. The 
system spends some time at the initial state Ho, moves from there to 
EF, stays for a while there, moves on to Ee, etc. The probability Po(t) 
that the sojourn time in Ep exceeds ¢ is obtained from (3.2) as Po(t) = 
= et This sojourn time, To, is a random variable, but its range is 
the positive t-axis and therefore formally out of bounds for this book. 
However, the step from a geometric distribution to an exponential be- 
ing trivial, we may with impunity trespass a trifle. An approximation 
to To by a discrete random variable with a geometric distribution shows 
that it is natural to define the expected sojourn time at Eg by 


(10.2) E(T,) = f tet dt = Ag. 
0 


At the moment when the system enters H;, the state E; takes over the 
role of the initial state and the same conclusion applies to the sojourn 
time T; at E;: The expected sojourn time at E; is E(T;) = d;—'. It fol- 
lows that No? +A, ¢ +...+A,~ is the expected duration of the 
time it takes the system to pass through £o, £1, ..., Hn, and we can 
restate the criterion of section 4 as follows: 

In order that ΣΡ, (ὃ = 1 for all t 1t 18 necessary and sufficient that 


(10.3) ZE(Tj) = Σὰ; = 0; 
that is, the total expected duration of the time spent at Eo, Ey, Eo, ... 
must be infinite. Of course, Io(t) = 1 — ΣΡ, (ὃ is the probability that 


the system has gone through all states before time ¢. 


21 The counterpart of this section in the first edition was entitled “degenerate 
processes.” 
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In this form the theorem is extremely plausible. If the expected 
sojourn time at E; is 2~’, the probability that the system has passed 
through all states within time 1 + 2. + 2.2 +... = 2 must be posi- 
tive. Similarly, a particle moving along the z-axis at an exponentially 
increasing velocity traverses the entire axis in a finite time. 

If the birth process serves as a model of population growth, the 
state H, stands for an actual population size n and reaching infinity 
in a finite time expresses a sort of explosion. In this connection (10.1) 
represents indeed a singular anomaly, but for other applications it may 
appear as a regular affair. Geometrically speaking, there is no reason 
to place the states Ho, H,, E2, ... at the points 0, 1, 2, ... of the 
x-axis. Imagine instead Μ΄, placed at the point 2, of the z-axis, where 
0 = % αι “2 <...andz, — 1. The birth process may then be 
pictured as the motion of a “particle” starting at x) = 0, jumping to 
“1, Jumping after a while to 19 and so on. In this picture, a particle 
which has passed through all states has reached the limiting point 1; 
it is natural that this event can occur in a finite time. Compare the 
probabilistic movement with a deterministic motion of a particle start- 
ing at the origin and having at the place x velocity f(x). Its position 
x(t) at time ¢ satisfies the differential equation 2’(¢) = f(x(é)) and the 
time 7 when the point 1 is reached is 


1 
(10.4) τε-- J = ζω. 


Whether or not the point 1 is actually reached in a finite time (or only 
asymptotically approached) depends on the convergence of the integral 
over the reciprocal velocity. In the probabilistic model the motion 
goes by jumps, but Δ, is the average time it takes to come from 2, to 
tn41- From this point of view (10.3) and (10.4) appear as twin criteria. 

Let us continue with the simple birth process and show how the cri- 
terion (10.3) is related to the problem of uniqueness of the solutions of 
the Kolmogorov differential equations. 

Considered as transition probabilities the P,,(t) of section 3 should 
be written as Pjn(t). The basic differential equations (3.2) apply 
equally to P,,(¢) for an arbitrary (but fixed) 7, and we have the forward 
equations 


(10.5) P’io(t) = --λοΡ;ο(ῦ, Ρ’ (ὃ = -- δ, κ(ἢ + Ae-1 Pita 


where 7 > 0 is fixed and k = 1, 2,.... In (8.4) and (8.5) we have the 
corresponding backward equations 7 


(10.6) P' x(t) = -λ Ρικ(ῦ + ὰ μι, κ(ῦ 
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where k > 0 is fixed and 2 = 0, 1, 2, .... The initial conditions are 
the obvious ones: 


(10.7) P,,(0) = 1, P,(0) = 0, k νεῖ. 


A glance at (10.5) shows that P,o(¢) is determined uniquely by the 
first differential equation together with the boundary condition (10.7). 
We can then calculate P,;,(¢) successively for k = 1, 2, .... How to 
solve a first order linear equation is well known, and we have the easily 
verified formula for the unique solution of the forward equation (10.5) 
with the initial condition (10.7): 


Puli) =0 for k <i, P,,(t) = et 


(10.8) 
P(t) = Ax-1 f e—**P;,_1(t — 8) ds, (k > ὃ. 
0 


The situation is completely different for the backward equations 
(10.6). We shall show that when ΣᾺ, < © there is no uniqueness 
of the solution. 


Lemma. The unique solution Pj(t) of the forward equations (10.5) 
given by (10.8) 1s automatically a solution of the backward equations (10.6). 
If P x(t) 18 any non-negative solution of (10.6)—(10.7) then 


(10.9) Pa(t) > Pix(t). 


Proof. Consider (10.6) putting a bar over all P;;(t). _The 7th equa- 
tion may be solved as a linear differential equation for P;,(¢) to obtain 


t 


P(t) = wif e Psat — s)ds k #7 
0 
(10.10) 
t 
Prux(t = ο΄ διέ + mf es. Pris lt — s)ds. 


0 
[Note that this is not a recursive system and cannot be used to solve 
the system of equations (10.6).] 

Let P.(é) stand for the solution of the forward equations given by 
(10.8). For each k andi >k put P(t) = P(t) = 0. These func- 
tions satisfy (10.10). Furthermore (10.10) defines P,,(é) = διέ = 
= P,,(t). Letting in (10.10) successively ὁ = k—1, k—2, ..., we get 
P(t) defined for all ὁ and they are a solution of the backward equa- 
tions (10.6) with the initial conditions (10.7). Clearly P,_1,( = 
= P;_1,4(t). We shall verify by induction that Ρ,μ( = Py,(t). Sup- 
pose that this identity holds for all combinations 7, k such that k — ἡ < 
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<r where r > 1 is an integer (we know this to be true for r = 1), 
and letk ~i =r-+1. The integral in (10.10) expresses P,, in terms 
of Pia, = P341,, and (10.8) in turn expresses ἤει,» aS an integral 
involving Ρίμι, κα = P4443, x1. We get thus P, as a double integral 
over Peay z—1- Reversing the order of integration we get 


t t—z 
P,(0) = Ma [ eer dxf e Pg —xz—s)ds= 
0 0 
(10.11) 
t 
= Ap_1 f e "=P. (ti — x) dz. 
0 


By the induction hypothesis P;,,_1(é) = P;,4-1(é), and a comparison 
of (10.8) and (10.11) proves that Ῥωμ( = Ρ;κ(ὃ as asserted. 

It remains to prove that (10.9) holds for an arbitrary solution 

P(t) > 0 of the backward equations (10. 6). Now both P(t) and 

P.x(t) satisfy (10.10). For 7 > k we have Py,(t) > P(t) = 0. Let- 
ting successively ὦ = k, k—1,k—2, ... we find that (10.9) holds for 
all z, k and the lemma is proved. 

We can now sum up the situation in the following way. Two con- 
tingencies can arise. 

(a) The case ΣᾺ, 1 = ©. We know from section 4 that in this case 
Σ P(t) = 1. It follows that any other positive solution of the back- 


cae equations necessarily adds to more than unity, which is inadmis- 
sible. Accordingly, in this case we have the uniqueness for the admis- 
sible solutions both of the forward and the backward equations. The 
common solution represents the transition probabilities of a birth process 
such that ΣΡικ( = 1. (It is easy to verify by differentiation that 
the Chapman-Kolmogorov equation (9.1) holds.) 

(Ὁ) The case ΣᾺ, < οοὄ. We know that in this case 2P,,(t) < 1. 
Then 


(10.12) LQ) Ξ 1 -- >> Pal) 
k=0 
is the probability that, starting from E;, “infinity” 1s reached before time t. 


We know that (10.10) is satisfied by Py,(t) = Ρ;κ(ῦ and by summation 
we see that 


t 
(10.13) L,(t) = wf eT 87 a(t — s)ds 
0 


or 
(10.14) (ἢ = -ε λμκ(ῇ + aALini® L,(0) = 0. 
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It follows that in this case the infinite system of differential equations 
(10.14) has a non-zero solution {L,(t)} with L;(0) = 0. With arbitrary 
Ax(t) the matrix 


(10.15) P(t) = Pix(t) + Lilt)Ar) 


1s a Solution of the backward equations (10.6) satisfying the initial condi- 
tions (10.7). | 

The question arises whether the A;(¢) can be defined in such a way 
that the P,(é) become transition probabilities satisfying the Chapman- 
Kolmogorov equation (9.1). The answer is in the affirmative. We 
refrain from proving this assertion but shall give a probabilistic inter- 
pretation. 

The P(t) define the so-called absorbing barrier process: When the 
system reaches infinity, the process terminates. Doob 33 was the first to 
study a return process in which, on reaching infinity, the system instan- 
taneously returns to Eo (or some other prescribed state) and the proc- 
ess starts from scratch. In such a process the system may pass from 
Ko to Es either in five steps or in infinitely many, having completed 
one or several complete runs from Ep to “infinity.”? The transition 
probabilities of this process are of the form (10.15). They satisfy the 
backward equations (10.6) but not the forward equations (10.5). 

This explains why in the derivation of the forward equations we were 
forced to introduce the strange-looking assumption 3, which was un- 
necessary for the backward equations: The probabilistically and intui- 
tively simple assumptions 1--2 are compatible with return processes, 
for which the forward equations (10.5) do not hold. In other words, 
if we start from the assumptions 1-2 then Kolmogorov’s backward equa- 
tuons are satisfied, but to the forward equations another term must be 
added.” 

The pure birth process is admittedly too trite to be really interesting, 
but the conditions as described are typical for the most general case of 
the Kolmogorov equations. Two essentially new phenomena occur, 
however. First, the birth process involves only one escape route out 
to “infinity” or, in abstract terminology, a single boundary point. By 
contrast, the general process may involve boundaries of a complicated 
topological structure. Second, in the birth process the motion is di- 
rected toward the boundary because only transitions EH, — E,4+, are 


22 J. L. Doob, Markoff chains—denumerable case, Transactions American Mathe- 
matical Society, vol. 58 (1945), pp. 455-478. 

28 Its form is given in the more recent paper cited in footnote 10, where the 
various types of processes and the appropriate boundary conditions are studied. 
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possible. Processes of a different type can be constructed; for example, 
the direction may be reversed to obtain a process in which only transi- 
tions En41 — E, are possible. Such a process can originate at the 
boundary instead of ending there. In the birth and death process, 
transitions are possible in both directions just as in one-dimensional 
diffusion. It turns out that in this case there exist processes analogous 
to the elastic and reflecting barrier processes of diffusion theory, but 
their description would lead beyond the scope of this book. 


11. PROBLEMS FOR SOLUTION 


1. In the pure birth process defined by (3.2) ἰοῦ λ, > 0 for alln. Prove that 
for every fixed n > 1 the function P,(t) first increases, then decreases to 0. 
If t, is the place of the maximum, then f; < tg < ts <.... Hint: Use induction; 
differentiate (3.2). 

2. Continuation. If ΣᾺ, = © show that ἐμ — οὐ. Hint: If t, — τ, then 
for fixed t > τ the sequence A,P,(£) increases. Use (4.10). 

3. The Yule process. Derive the mean and the variance of the distribution 
defined by (8.4). [Use only the differential equations, not the explicit form 
(3.5).] 

4, Pure death process. Find the differential equations of a process of the 
Yule type with transitions only from E, to H,-1. Find the distribution P,(6), 
its mean, and its variance, assuming that the initial state is 7. 

5. Parking lots. In a parking lot with N spaces the incoming traffic is of 
the Poisson type with intensity A, but only as long as empty spaces are avail- 
able. Find the appropriate differential equations for the probabilities P,(t) 
of finding exactly n spaces occupied. 

6. In a waiting line the customer who came last 18 served first.24 Find the ap- 
propriate differential equations for the probabilities P,,(¢) that exactly n new- 
comers will be served during the waiting time of a customer picked at random. 

7. The Polya process.*® ‘This is a non-stationary pure birth process with A, 
depending on time: 

(11.1) x(t) = <= ee 


Show that the solution with initial condition Po(0) = 1 is 


Pot) = (1 + αὐ) τς 
(11.2) 
(i + a)(1 + 2a)---{1+(n — Ia} 


P,(t) = (1 + at)—* Ve - 


24K. Vaulot, Delais d’attente des appels téléphoniques dans |’ordre inverse de 
leur arrivée, Comptes Rendues, Académie des Sciences, Paris, vol. 238 (1954), pp. 
1188-1189. 

35 Ὁ, Lundberg, On random processes and their applications to sickness and acci- 
dent statistics, Uppsala, 1940. 
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Show from the differential equations that the mean and variance are ¢ and 
t(1 + aé), respectively. 

8. Continuation. The Polya process can be obtained by a passage to the 
limit from the Polya urn scheme, example V(2.c). If the state of the system 
is defined as the number of red balls, then the transition probability E, - Ex41 
at the (n-++-1)st drawing is 
(11.3) ism ὑπευ;; 


Pen Tb+ne 1+ny 


where p = r/(r + δ), y = c/(r + δ). 

As in the passage from Bernoulli trials to the Poisson distribution, let 
drawings be made at the rate of one in time h and let h + 0, n — © so that 
np — t, ny — at. Show that in the limit (11.3) leads to (11.1). Show also 
that the Polya distribution V(2.3) passes into (11.2). 


9. Linear growth. If in the process defined by (5.7) ἃ = μ, and P,(0) = 1, 
then 


(11.4) Pot) = 


λέ 
1 --Ἐλὶ 


(λῦπτι 


PO = τ Ἔχει 


The probability of ultimate extinction is 1. 


10. Continuation. Assuming a trial solution to (5.7) of the form P,(t) = 
= A(t)B"(é), prove that the solution with P,(0) = 1 is 


(11.5) Pol) = BQ), —Palt) = [1 -- ABO} {1 — wBO} ABO} 
with 


1 = εο(λ--μ)ὲ 
(11.6) Bi) a L— λείλ--μ)ε, 
11. Continuation. The generating function Ρίβ, ὃ) = ΣΡ, (ὃ) 85 satisfies the 
partial differential equation 


oP oP 
op = lH — A+ ns + As} τὸ 


12. Continuation. Let M.(t) = ΣΡ, (ὃ and M(t) = 2nP,(t) (as in section 
5). Show that 


(11.7) 


(11.8) M'{t) = 20 — w) Malt) + (A +) Mi). 
Deduce that when ἃ > μ the variance of {P,} is given by 
(11.9) ePO-WHT — eH D + μ)Δὰ — μ). 


13. For the process (7.2) the generating function P(s, ὃ) = ΣΡ, (ὃ 55 satis- 
fies the partial differential equation 
(11.10) 7-9) {AP tus. 

Its solution is 


P = e>Q-) (1-8 fat y -- (1 — 8). μὴ) 
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For ὃ = 0 this is a Poisson distribution with parameter \(1 — e~#*)/u. As 
t — οὐ, the distribution { P,,(t)} tends to a Poisson distribution with parameter X/p. 


14. For the process defined by (7.26) the generating function P(s, t) = 2P,(t)s” 
satisfies the partial differential equation 


(11.11) (u + As) na = adP, 


with the solution P = {(u +As)/(A + p)}*. 


15. In the “simplest trunking problem,” example (7.a), let Q,(t) be the 
probability that starting from EH, the system will reach Ko before time t. 
Prove the validity of the differential equations 


QW (ἢ) = —A + πμ)ρη() + AQn "(ἢ + HQn—1(), (n > 2) 
Qt) = —A + w)Qilt) + AQ2t) + 


with the initial conditions Q,(0) = 0. 

16. Continuation. Consider the same problem for a process defined by an 
arbitrary system of forward equations. Show that the Q,(t) satisfy the cor- 
responding backward equations (for fixed k) with Po,(t) replaced by 1. 

17. Show that the transition probabilities of the pure birth process and 
those of the birth and death process satisfy the Chapman-Kolmogorov equa- 
tion (9.1). 

18. Let Pz(t) satisfy the Chapman-Kolmogorov equation (9.1). Supposing 
that P(é) > 0 and that S,(é) = » P(t) < 1, prove that either S,(¢) = 1 for 

k 


(11.12) 


all ¢ or S:(t) < 1 for all ¢. 

19. Ergodic properties. Consider a stationary process with finitely many 
states; that is, suppose that the system of differential equations (9.9) is finite 
and that the coefficients c; and p, are constants. Prove that the solutions 
are linear combinations of exponential terms e*“'—7) where the real part of λ 
is negative unless ὰ = 0. Conclude that the asymptotic behavior of the 
transition probabilities is the same as in the case of finite Markov chains except 
that the periodic case is impossible. 


Answers to Problems 


‘ CHAPTER I 


1. (a) $; (6) 3; (c) x. 

2. The events Si, S2, S; U Se, and S1S2 contain, respectively, 12, 12, 18, 
and 6 points. 

4. The space contains the two points HH and TT with probability +; the 
two points HTT and THH with probability ξ; and generally two points with 
probability 2—" when n > 2. These probabilities add to 1, so that there is no 
necessity to consider the possibility of an unending sequence of tosses. The 
required probabilities are +3 and 4, respectively. 

9. P{AB} = 34, P{A U B} = 38, P{AB’} = 3 


12. x = 0 in the events (a), (δ), and (9). 
x = 1 in the events (e) and (f). 
x = 2 in the event (ὦ). 
x = 4 in the event (c). 


15. (a) A; (6) AB; (c) Β U (AC). 

16. Correct are (c), (d), (e), (5), ὦ), (ὦ, (ὦ), (ἢ. The statement (α) is mean- 
ingless unless C C B. It is in general false even in this case, but is correct in 
the special case CC B, AC = 0. The statement (Ὁ) is correct if C D AB. 
The statement (g) should read (A U B)—A=A’'B. Finally (4) is the cor- 
rect version of (7). 

17. (a) AB’C’; (6) ABC’; (ΘΔ. ABC; (d) AU BUC; 

(Ὁ) AB U AC U ΒΟ; (f) AB'C’ U A'BC’ U A'BIC; 
(g) ABC’ U AB'C U A’BC = (AB U AC U BC) — ABC; 
(Δ) A’B’C’; (ἢ (ABC)’. 
18. A UBUC=A U (B— AB) U (6 —C(A U B)} = 
= A U BA’ U CA’B’. 


CHAPTER II 


1. (a) 26%; (b) 265 + 26? = 18,252; (ὁ) 267 + 26° + 264. In a city with 
20,000 inhabitants either some people have the same set of initials or at least 
1748 people have more than three initials. 

2. 64-14 = 896. Fora chess board with n’? fields the formula is n2(2n — 2). 

8. 2(210 — 1) = 2046. 


_ Un = mn + 1) ΙΝ 1 
3: (5) τῇ ὃ: SO ; (0) n(n — 1) 
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6. (a) ρι = 0.01, po = 0.27, ps = 0.72. 
(ὃ) pi = 0.001, po = 0.063, p3 = 0.432, ps = 0.504. 

Le Da (10),10-". For example, p3 = 0.72, pio = 0.00036288. Stirling’s 
formula gives pio = 0.0003598 .... 
" Ἢ ne (9/10)*; (Ὁ) (9/10)*; (c) (8/10)*; (ὦ) 2(9/10)* — (8/10)*; (e) AB and 


n 12 
μ,πτπ .-- = alg 
0, (7)» nN 5 10. 9 Φ ( 8 ) . 


11, The prope εὐ of exactly r trials is (n --- 1),-1 + (), =  η, 1. 
12, (a) 1/1-3-5 --- (Qn — 1) = 2m!/(2n)!; (Ὁ) (nl)/1- 8: . (Qn - ἡ) Ξ 


ΝΟΣ 


13. On the assumption of randomness the probability that all of twelve 
tickets come either on Tuesdays or Thursdays is (#)'* = 0.0000008 .... 


There are only (*) = 21 pairs of days, so that the probability remains ex- 


tremely small even for any two days. Hence it is reasonable to assume that 
the police have a system. 

14, Assuming randomness, the probability of the event is ($)!* = 4 appr. 
No safe conclusion is possible. 

15. (90)10 + (100)10 = 0.830476 .... 

16. 251(5!)-*5—-*> = 0.00209 .... 


17 a(n -- 2),(η --  -- }] 2(5 --τ -- ) 
; n! ~ n(n — ἢ 


.18. (a) xis; (δ) sBés- 

19. The probabilities are 1 — ($)* = 0.517747 ... and 1 — (3%)* = 
= 0.491404 .... 

20. (a) (n — N), + (n)y. (δ) (1 — N/n). For r = N = 8 the probabilities 
are (a) 0.911812 ...; (6) 0.912673 .... Forr = N = 10 they are (a) 0.330476; 
(b) 0.348678 .... 

21. (a) (1 — N/n)*—. (Ὁ) (n) wr + ((M)y)”. 

22. (1 — 2/n)*—*; for the median 27 1 = 0.7n, approximately. 

23. On the assumption of randomness, the probabilities that three or four 
breakages are caused (a) by one girl, (6) by the youngest girl are, respectively, 
ἐξ = 0.2 and 3% ~ 0.05. 


2 
24. (a) 121/122 = 0.000054. (0) (2 (28 — 2)12- = 0.00137 . 


801 (ΙΖ. ay _ 
25. 5 (,) 12 ~ 0.00035 .... 


2c (2) το τρυγῶ, 
0 (3) (R22) 2+ ὦ 
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% ear Gene 


ΕΣ (ων w= [2,(Νπ)}}. 


G(s 2) Gs) Gs) _ G) Gs) 


Maen) 1c) 


30. Cf. problem 29. The probability is 


et ee | Gas | or sen a try) rd | 
an (,) (we μ᾽ (5 
tern | Fete eee 
(7a) Gs) ὦ 
33. (a) 24p(5, 4, 3, 1); (b) 4p(4, 4, 4, 1); (c) 129(4, 4, 3, 2). 
ὦ [ἢ (ὃ (ὃ 
(i 


hand contains a cards of some suit, ὃ of another, etc.) 

35. por) = (52 — ra + (52)4; pi(r) = 4γ(52 — r)3 + (52)a; 
par) = Gr(r — 1)(52 — 7). + (52)4; 
por) = 4r(r — 1)(r — 2)(52 — r) + (52)4; pa(r) = (σὴς + (82). 

36. The probabilities that the waiting times for the first, ..., fourth ace 
exceed r are Wi(r) = po(r); wer) = polr) + pi(r); w3(r) = po(r) + pi(r) + po(r); 
ae = 1— par). Next fi(r) = wir) — wr + 1). The medians are 8, 20, 


we oa ak ων 
with k < 2; (b) [ῷ C= ) + (5. with k « 4. 


39. ond ie 40. (ara (τς + 1). 


7 


(Cf. problem 33 for the probability that the 


1, @ + re + τ 


τι ΠΣ 42. (49). + (52). 
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48. P{(7)} = 10-10- = 0,000 001. 
P{(6, Ὁ} = ἘΠΤῚ an 10-7 = 000063. 
P{ (5, 2)} = ἘΠΤῚ : πὶ - 10-7 = (000 189. 
P{(5, 1, 1)} = sant aa - 1077 = (001 512. 
P{(4, 3)} = τς sar 10-7 = (000 8318. 
P{(4, 2, 1)} = ae aan -10-7 = =_—«.007 560. 
P{(4, 1, 1, 1)} = a aaa 10-7 = _—O17 640. 
P{(3, 3, 1)} = “stl ἜΕΙ - 10-7 = (05 040. 
P{(3, 2, 2)} = soni "ΔΕ - 10-7 = (07 560. 
P{(3, 2, 1, 1)} = aan om -10-7 == .105 840. 
P{(3, 1, 1, 1, 1)} “ΒΕ “πα 

ake 5141} 1{{111|8! 
P{(2, 2, 2, 1)} = — oe -10-7 = =_~—«.052. 920. 
P{(2, 2, 1, 1, 1)} = sai — -10-77 == 317520. 
P{(2,1,1,1,1,)} = Teri oe 10-7 = .317 520. 
P{(1,1,1,1,1,1,1)} = am ay Altea 103 = (000 480. 


44. Letting S, D, T, Q stand for simple, double, triple, and quadruple, 
respectively, we have 


365! Ε 

Ρ(295) = Sear] * 365 = 0.524 30. 
P{208 + 1D} = a sy B05 = 85208. 
P{ 189 + 2D} 7 ΕΣΉΤΕΙ ! — 365-2 - 096 95. 
P{ 16S + 3D} — 868Ἢ΄ _ 22! sags = 014.29, 
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P{19S + 17] = oe τ . 365-2 = (0680. 
Ρ[175 - 10 +1T} = oe — . 365-2 = 008 36. 
Ρ[145 + 4D} = oe : — . 365-2 = 001 24. 
P{158 + 2D + 1T} = ee : oi . 365-2 = 000 66. 
P{18S + 10) - oe aa 365-2 = (0009. 


45. Let q = (*) = 2,598,960. The probabilities are: 
(a) 4/q; (6) 13-12-4-q7* = ayes; (ὦ) 13-12-4-6-q7! = ayes; 


12 
(ὦ 9-45-q7! = σύφῥεοι (0) 13 - (5) 4-42-91 = atts: 


12 : 
(f) (75) - 11:6:6.4:σπἰ = ἀβῥνν @) 18.- (5) - 6-4-7! = 6188. 
CHAPTER IV 


1. 99/323. 26 O21 his 3. 1/4. 4. 7/28, 
5. 1/81 and 31/6. 
6. If A; is the event that (k, k) does not appear, then from (1.5) 


35\" /6\ /34\", (ΘᾺ (38 
1 ~ v= 6 (55) ~ (3) Ge) + Ὁ) Ge) - ὦ Ga) +8 Ga) - Ge) 
52 48 13\ 44 
=) = . — bs = . 
7 Put p ι: Then δὲ 18 (9}Ρ; Se (3) Ge 
ὅς = 40. (2 -p. Numerically, Pio = 0.9658; Pi = 0.0341; Pio = 0.0001, 


approximately. 


8. ὦν, - Σοὶ () ( = ΄. 


x N —k), 
= >) (-- Ὁ} ( ᾿ Mi de, See [I(12.18) for a proof that the two 
k=0 
formulas agree. 
10. The general term is @1%,d2. -.. @new, Where (ki, ko, ..., kn) is @ permu- 
tation of (1,2, ...,.N). For a diagonal element k, = ν. 


ey n\ (ns — ks) r 
εὐ δ iy’ oO) (ns) » 
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14. Note that, by definition, u, = 0 for r <n and u, = n's"/(ns) n. 


7 5 n ee (ns = ἰδ). Ὁ 
15. Ur Ur~1 Pai 1) ΑΝ es (ns — 1), — ΤΡΈΜΕΙ 


“Σου ("7") 
k=0 k 
ΝᾺσ (N\ & ma (1 (kN 
6 (5) Cn) 2-9" (7) (2) 
52 4A\N “52 — 13k 
17. Use (5) =) 5): 
Proj = 0.264, Pry = 0.588, Pia} = 0.146, Pi3) = 0.002, approximately. 
52 4\ /52 — 2k 
roe Nee ia) on 5) (ς -- δι)" 


Pio) = 0.780217, Pty = 0.204606, Pio) = 0.014845, 
Pi3} = 0.000330, Pis; = 0.000002, approximately. 


r—l 


N—m 
19. m!Nlum = >) (—1)"N — m — k)I/kl. 
k=0 


20. Cf. the following formula with r = 2. 
21. (rN) lx = (3) (rN — 2)! — (3) r(rN — 3)!-+ —...4 
+ (~1)4*r*(rN — ΝῚ}. 
See τ τ 
τ τ em k r 
r 


25, Use II(12.16) and (124). 
26. Put Uy = A, U... U Aw and note that ὕνιι = Un U Aya: and 
ὕναν.ι = (A1Aw 41) U...U (AvAwn41). 


24, Pim) = 


CHAPTER V 
(5)3 1 10-57 
1 πῶ τ  & p=l— apap = 081... 
3. (a) (: 9): Cc = 0.182 .... The probability of exactly one ace is 
35 
4. (: >) of (vs = 0.411.... (0) 1 — 0.182 — 0.411 = 0.407, approximately. 


as ἃ 
ae & = ἐδ 
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428; 448; is 7. 99. 9. (ᾧ). 10.1-- 
12. —_ 13. (0) ἃ; (ὦ 2"-(1 + 2"). 


14. (d) Put an = tn — 5, bn = Yn — 4, Cn = 2n — ἤ. feats ee 
+ |eal = 3{|@n4i] + [Bn4i] + len4il}. Hence |an| + 
creases geometrically. 

15. p = (1 — pi)(l — po) --- (1 — Dn). 

16. Use 1 — x < e~* for 0 < z <1 or Taylor’s series for log (1 — 2); cf. 
I1(8.12). 


b+e : 
b+ct+r 


19. If the statement is true for the nth drawing regardless of ὃ, r, and c, 
then the probability of black at the (n+-1)st trial is 


18. 


b+r ee Re b+r+te ~ b+r 
20. The preceding problem states that the assertion is true for m = 1 and 
all n. For induction, consider the two possibilities at the first trial. 
23. Use II(12.9). 
26. From (5.2) ων = 2n(1 -- p) 3 
28. (a) u?; (Ὁ) u? + w + v?/4; (ὦ ᾿ + (25uv + 9v? + vw + 2uw)/16. 
88. Du = Pse = 2p = PD, Piz = P3s = 2es = 4, Pis = Pai = 0, Poe = ὅ. 


CHAPTER VI 
1. τς. 2. The probability is 0.02804 .... 3. (0.9): < 0.1, 2 > 22. 
4. φ' < } and (1 — 4p)* < } with p = @ & (Ὁ): Hence 2 > 263 
and x > 66, respectively. 


5. 1 — (0.8)9 — 2(0.8)9 = 0.6242 .... 
6. {1 — (0.8)! — 2(0.8)9}/{1 — (0.8)°} = 0.6993 .... 


26\ /26\ (82 13\ 1 
τ. (5) (70) - G3) = 9.003954 ..., and (2) sim = 9.00952 .... 
8. (5) {6 2.12-8}, 


9. True values: 0.6651 ..., 0.40187 ..., and 0.2009 ...; Poisson approxi- 
mations: 1 — e—! = 0.6321 ..., 0.3679 ..., and 0.1839 .... 


6- Σ᾽ ὩΡ κὶ = 0.143.... 11. ὁπ Σ) 1,8} = 0.080... 
4 3 


12. e77/100 < 0.05 or x > 300. 

13. e~! = 0.3679 ..., 1 — 2-e7! = 0.264 .... 
14. e~* < 0.01, « > 5. 15. 1/p = 649,740. 
16. 1 — p” where p = p(0;A) +...+ p(k;A). 
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18. q*® for k = 0; pq’ for k = 1, 2, 3; and pq? — ρᾳβ for k = 4. 


19, >> ee es a) ot ee (= Ε fon large n. 
k—0 
a-+b—l a Ἔ ᾿ ἐν 
x ( 


k= 


20. ‘) piq?t’—!-*, This can be written in the alternative 


form ρα > { a ΤΡ ‘) q*, where the kth term equals the probability that 


the ath success occurs directly after k < 6 — 1 failures. 


εν ων 
N-1 ᾿ 


22. (α) c= > a aa aed > εὐ ν, τ, νὰ as: (b) Use IT(12.6). 


r=l 


21. x, = 


23. k; ~ NDi, ky ~~ NPi2 whence n~ kike/k1. 


n— 81 — 8;--1 
- (Ὁ). 25. στον 
i oi: g°rp T 


where 8; = my +...+ 7. 


25. p = p1go(Pige + peg)". 
31. By the Taylor expansion for the logarithm 


b(0; n, p) = gq” = (1 — A/n)” < e = p(05d). 


The terms of each distribution add to unity, and therefore it is impossible that 
all terms of one distribution should be greater than the corresponding terms 
of the other. 

32. There are only finitely many terms of the Poisson distribution which 
are greater than e, and the remaining ones dominate the corresponding terms 
of the binomial distribution. 


CHAPTER VII 


1. Proceed as in section 1. 2. Use (1.7). 3. Φί-- 33) = 0.148 .... 

4. 0.99. 5. 500. 6. 66,400. 

7. Most certainly. The inequalities of chapter VI suffice to show that an 
ge of more than eight ee deviations is exceedingly improbable. 


8. (2n)—{ pipo(1 — pi — p2)}—? 
CHAPTER VIII 


1, β = 21. 
2. 2 = pu+ q+ ru, where u, v, w are solutions of 
i a—l B-1 
w= pe + (qu + 10), υ = (pu + rw) <4 — 


w=putqt+rw=xz. 
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net 1 — p*! 
3. ὦ = pe + (q+ rw) =~ 
-» 
1 -- αβ-ὶ ἘΠῚ 
v= (pu + τι) το“ - tb = (pie + 00) — 


4. Note that P{A,} < (2p)”, but 
P{A,} »͵ΠλΡ-πᾷᾳ — pp)?" > 1 - eT Onan, 


If p = i, the last quantity is ~}n; if p > 1, then P{A,} does not even tend 
to zero. 
CHAPTER IX 


1. The possible combinations are (0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (2, 0), 
(2, 1), (8, 0). Their probabilities are 0.047539, 0.108883, 0.017850, 0.156364, 
0.214197, 0.821295, 0.026775, 0.107098. 

2. (a) The joint distribution takes on the form of a six-by-six matrix. The 
main diagonal contains the elements g, 2g, ..., 6¢ where g = 3g. On one 
side of the main diagonal all elements are 0, on the other g. (6) E(X) = 
Var(X) = #3, ἘΠῚ) = 4%, Var(¥Y) = ἐσ δ ξ, Cov(X, Y) = 772. 

3. In the joint distribution of X, Y the rows are 32. 1 times (1, 0, 0, 0, 0, 0), 
(0, δ, 4, 3, 2, 1), (0, 0, 6, 6, 3, 0), (0, 0, 0, 1, 0, 0); of X, Z: (, 0, 0, 0, 0, 0), (0, ὅ, 
6, 1, 0, 0), (0, 0, 4, 6, 1, 0), (0, 0, 0, 3, 2, 0), (0, 0, 0, 0, 2, 0), (0, 0, 0, 0, 0, 1); οἵ 
Y, Z: (1, 0, 0, 0), (0, 5, 6, 1), (0, 4, 7, 0), (0, 3, 2, 0), (0, 2, 0, 0), (0, 1, 0,0). Dis- 
tribution of X + Y: (1, 0, 5, 4, 9, 8, 5) all divided by 32, and the values of 
X + Y ranging from 0 to 6; of XY: (1, 5, 4, 3, 8, 1, 6, 0, 8, 1) all divided by 82, 
the values. ranging from 0 to 9. E(X) = 3, E(Y) = 3, E(Z) = 
Var(X) = %, γα) = ὃ, Var(Z) = 85. 

4. P{Z=i1,K Ξ 7) =q'tp® if i> 7 and = (1—q'**)q'’p if i= 7; no 
other values are possible. ΡΙΖ = 1} = δεῖ -- αἰ — qt p. 

8. The distribution of V,, is given by (8.5), that of U, follows by symmetry. 


ΛΞ =) 
‘or aa 

P{X=7r,Y=s} =N-{(r —s + 1)” — Ar — 8)" + (ἡ — 8 — 1)5}. 
if r>s, and =N™” if r=s. 


pn —2 —— (r a La 


9. PiX<7r, Y> s} = forr=> s; 


10. « = if 7<randk <r. 


(reat 
gn—2 
= -οο»»--.---...-- ----- 1 ΄« = = 
x ας - ἢ if jr and k=r, or j=r and k<r. 
x= 0 if j7>rork>r. 
11. o? = cl 


(n + 1)%(n +2) 
12. P{N = n,K =k} = () p"—*(qq')* gp". 
P{N = n} = (1 — φργ" a 
P{K =k} = (qr)'ap'= (—"*) (-2y = ρον 
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K = αι = ee oe N\n—1 
BE (Gay = Lhpz.n/(n + 1) = g*p’¢ Σ( a) @ +a) 


qq’ p’p'¢' τ 
φ' (.-- py op 


ae 
ἊΣ .( -- φῇ͵ .- a 
E(K) = » ; EN) ap’ ; Cov(K, N) gp”? 


PAN) = |G mi 


14. py = ρίᾳ + op; E(X) = ρᾳ " +p; Var(X) = pq? + gp? — 2. 
15. φε = pq?! + cP; ΡΙΧ = m,Y =n} = p™tlgn + g™tp” with 
mn2>l; EY) =2; oF = Apqit+ gp — 1). 


17. () 83645-56δ1-- 


18. (a) 865{1 — 364"-365—" — n364"—!-365-"} ; (Ὁ) n > 28. 

19. (a) μ = ἢ, o? = (n — 1)n; (ὃ) w = (n+ 1)/2, o? = (n? — 1)/12. 
20. E(X) = npi; Var(X) = npi(l — pi); Cov(X, Y) = —npipo. 
21. —n/36. This is a special case of 20. 


Ν _SNN-r+h—1) 
25. E(Y,) = Sao cep Vert) = 2 Gara 


ee 


26. (Ὁ 1 -- φ; () Ἐ() -- Ν {1-- φ Ὁ 1}; (c) = 0. 


27. 2(1 — pj)". Put X; = 1 or 0 according as the jth class is not or is pre- 
resented. 


(Te + ri(r2 + 1) γι (71 — 1)(r2 -- 1) , 
eee ttre ae (σι + 72 — 1) + 12)? 
nbr{b + r+ nc} 


nb 
WS) Ἐπ ON eae aay 


33. E (=) = rk ( ᾿ ἢ po 
= Σ (—1)'-1 —— (2) ΕΞ (—*) r log p. 


k==1 


To derive the last formula from the first, put f(q) = r>k7 (’ Ν ᾿ q*. 


Using II(12.4), we find that f’(¢) = rg7—(1 — φ) τ’, The assertion now follows 
by repeated integrations by part. 


CHAPTER XI 
1, sP(s) and P(s?). 
2. (a) (1—s)"*P(s); (ὁ) ( -- 8) ~*sP(s);_ (ὦ a — sP(s)}/(1 — 8); (ὦ 
pos) + {1 — s“P(s)}/(1 — 8); (6) 3{ P(st) + P(—s})}. 
3. U(s) = pgs’/(1 — ps)(1 — gs). Mean = 1/pq, Var = (1 — 3pq)/p?4’. 
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6. A zero is the first, second, third, ... zero and therefore U(s) = DF*(s). 
7. The generating function is {1 — F(s)}(1 — s)—' = (1 + 8)U(s). 
8. The generating function is {$F (s)}* = 2F(s) s~? — 1. 
9. Same generating function. 
10. The kth zero must occur at a trial number 2r < n and the ensuing 
n — 2r trials must not produce a zero. 
11. Use an obvious analogue to (1.6) for the case where P(1) < 1. 
12. Using the generating function for the geometric distribution of X, we 
have without computation 


mney ee (Fr ;) Εἰ = ~ 3) * a eee x): 
13. P,(s){N — (r — 1)s} = P,_-1(s)(N — r — 1)8. 


Ν -- ΟἊδ -- 18 N —(N — 2)s N — (N -- γ)8 


15. 5, is the sum of r independent variables with a common geometric 
distribution. Hence 


P,(s) = (: a) Dra = Op oe oa ia ΝΣ 


16. P{R =r} = 5 PIS, = - ΕἸΡΙΧ,Σ pik) Ξ 


-Σ σφ (Ἐπ ‘et arg (ltrs? 


ν-- 1 


14. Ρ,(8) = 


ER)=1+5,  Var(R) = =. 


“ /k—-1 | | 
21. un = g” ἘΣ ( 9 ) p*q*—*un_~ With uo = 1, uw = 4, ue = @?, ug = 
=3 


= »" - φῆ, Using the fact that this recurrence relation is of the convolution 
type, 


᾿ς sal (ps)* 
U(s) = ice 5 + (1 — qs)8 U(s). 


22. Un = PWn-1 + QUn—1, Un = PUn—1 + Wn—-1, Wn = PUn—-1 + QWn-1. Hence 
U(s) — 1 = psW(s) + gsU(s); V(s) = psU(s) + gs-V(s); W(s) = psV(s) + 
+ gsW(s). 

CHAPTER XIII 


1. It suffices to show that for all roots 8 ¥ 1 of F(s) = 1 we have |s| 1, 
and that |s| = 1 is possible only in the periodic case. 


2 ᾿ : : 
2. Urn = [(ὕ 2-5] ~ (xn). Hence & is persistent only for r = 2. 


For r = 3 the tangent rule for numerical integration gives 


C~_/ —j —} -----.ὕ.ὕ -- mS ee 
Σ baci [ Ree (=) 2 
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Hence by (3 .5) the probability of & ever occurring is, approximately, + = 3. 
A more precise evaluation of the sum is 0.47 and leads to x = 0.82. 

3. S, = 0 is possible whenever n = k(a + 6), and the binomial distri- 
bution shows that for such n we have P{S, = 0} ~ (a + b)#(2rabk)—?. The 
series diverges. 

4. From 2f, + P{X: > 0} < 1 conclude that f < 1 unless P{X; > 0} = 0. 
In this case all X, < 0 and & occurs at the first trial or never. 

ὃ. Let wi, ..., un be given and Un = Piln—1 + Polln_e +... + pnun—n for 
n=>N. Then 


wipn + Ul(pn + Py-i) +... + Un(pi + pe +...+ py) 
Pit 2po+ 3p3+... 


If uw, = N~ then lim uz, = Ν᾿ ἴω + 21 +...+ Nuy). 


6. Urls) = Σ + ἐσ»), - XG F)* = 5.9). 


lim uy, = 


7. Fs) = $s{1 + Fi(s)} = s~'F(s). 

8. U2(s) = {1 — F.(s)}-! = ξ + ξ( + 8)(1 — 525 τ-ὸ This shows the prob- 
ability of a first passage at tine 2n through a positwe point to equal 4 the 
probability of So, = 0. 

9. (a) F(s) = gs(l — ps’) 1,6 = 1+ ρα ', σῇ = rpq™; (Ὁ) Zn = n—Na, 
E(Z,) ~ npq(qt pr)—!, ατί(Ζ,) = nrpq(q + pr). 

10. U(s) = 1+ 4s 4. ἐν τῇ gts" 1 + g’s"(1 — 85). μ΄. = ἣῦ, 

11. N,* = (Ν, -- 714. 3)/22. 75; (4) — φί-- 2) = §. 

12. rp = Tn—1 — F%n—2 + Sfn—3 With ro = 71 = 72 = 1; 

R(s) = (8 + 252)(8 — 8s + 2s? — s%)—}; tn™~l 444248(1, 139680) οἰ, 

14. If a, is the probability that an A-run of length r occurs at the nth trial, 
then A(s) is given by (7.5) with p replaced by a and g by 1 —a. Let B(s) 
and C(s) be the corresponding functions for B- and C-runs. The required 
generating functions are F(s) = 1 — U~\(s), where in case (a) U(s) = A(s); 
in (ὃ) U(s) = A(s) + B(s) — 1; in (c) U(s) = A(s) + Bis) + C(s) -- 2. 

15. Use a straightforward combination of the method in example (8.6) and 
problem 14. 

17. un = Np, υκ(ο) = Nog. 

19. Note that 1 — F(s) = (1 — s)Q(s) and μ — Q(s) = (1 — s)R(s), whence 
Q(1) = μι, 2Ν() = o? --τἰἰκκγρ +p. The power series for Q's) = (un — un—1)s” 
converges for s = 1. 


CHAPTER XIV 


1. The probability of ruin is still given by (2.4) with p = a(1 — γ) 1, 
ᾳ = B(1 — γ) 1. The expected duration of the game is D,(1 — y)— with Ὁ, 
given by (3.4) or (3.5). 

2. The boundary conditions (2.2) are replaced by go — ὅσι = 1 — δ, 4α = 0. 
To (2.4) there corresponds the solution 


4, = {(q/p)* — (q/p)*} (1 — δ) + {(@/p)? (1 — δ) + δᾳ» — 1}. 


The boundary conditions (3.2) become Dy = 6D,, Ὦ, -Ξ 0. 
3. To (2.1) there corresponds 45 = pqz+2 + Ω0,4--τ, and q, = A7 is ἃ particu- 
lar solution if \ = pA* + q, that is, if X = 1 or λξ +A = qp. The prob- 
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ability of ruin is 
if gq > 2p 


1 
ffl, ni. 1)? ᾿ 
w= 1G+5) ᾿ a Ξ. 3». 
5. Wen4(t) = PWe4i,n(t) + qwz—i,n(x) with the boundary conditions (1) 


Wo, n({Z) = Wa,n(t) = 0 for n= 1; (2) wz,o(z) = 0 for 2 ¥ x and wz,o(x) = 1. 
6. Replace (1) by wo, A(z) = ταῦ and We,n(z) = Wa-1,n(2). 


10. P{M,, < 2) ΞΞ > ὑός n — Uzte n) 
z=1 


Pi{M, = z} = P{M, <z+1} — P{M, < 2}. 
11. The first passage through x must have occurred at a time k < n, and 
the particle returned from «x to z in the following n — & steps. 


CHAPTER XV 


1. P has rows (p, φ, 0, 0), (0, 0, », 4), (p, α, 0, 0), and (0, 0, p,q). Forn > 1 
the a are (p”, pq, PQ, ἢ ay, 

2. (a) The chain is irreducible and ergodic; p — + for 817, 8. (Note 
that Pi is doubly stochastic.) (6) The chain has period 3, with G; containing FE 
and Κῶ; the state EH, forms Gz, and H3 forms G3. We have τὼ = uz = 5, 
ug = us = 1. (ὁ) The states #, and £3 form a closed set δι, and £4, #; another 
closed set So, whereas Hy is transient. The matrices corresponding to the 
closed sets are two-by-two matrices with elements 3. Hence pe — ΜΕ; 
and Εἰ, belong to the same S,; pi? — 0; finally py? — ὦ if k = 1,8, and 
pak — 01} = 2,4,5. (ὦ The chain has period 3. Putting a = (0, 0, 0, 3, 
3, 1), b= (1, 0,0, 0,0, 0), c = (0,2, 4,0,0,0), we find that the rows of 


cpl + σε σα bob. 600 7, those of P= P&=.,. are ὃ, ©, 6, a, a, a, 
μοῦ of P= P4= ... arec, a, a, ὃ, ὃ, ὃ. 

3. pi = (7/6)", pik = (k/6)" — (ἃ — 1)/6)" if k > j, and p® = O if 
bk <j. 

4. τις = (ἢ, ἃ, 3,9), Yk = ay oy dy a) 


6. Put u = Znpn. The states are null states if uw = %. Stationary dis- 
tribution: uz, = μ (pe + Pega +-..). | 

7. Ergodic if ΣᾺ -- go)(1 — a) ... (1 — 4,..) < ©. Stationary distri- 
bution proportional to the terms of the series. 

8. ur = (p/g)"(q — p)/D. 

9. ur = {1 — p/a}(p/q)’* + {1 — (/9)%}. 


10. pj; = 2(N —9)/N*, ρῆῃμι =(N -- ὟΝ", Daz = P/N’, 


m= =) 


q Pp 0 

001 00 

0 00 

18. P| eke συν 
00 0 01 
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14. Note that the matrix is doubly stochastic; use example (6.0). 

15. Put ρκ,κ..1 = 1 fork = 1, ...,N—1, and pyz = py. 

16. 2u;pj. = uz, then U(s) = μι! — s){P(s) — s}—. For ergodicity it is 
necessary and sufficient that P’(1) < 1. 

23. Let M be the maximum of z;. Consider the states ΕἾ, for which z, = M. 

26. If N > m — 2, the variables X and X™ are independent, and hence 
the three rows of the matrix pix’™ are identical with the distribution of ΧΟ», 
namely, (4,4,%). For n =m-+1 the three rows are (3, 3,0), (4, 4, 3), 
(0, 2) 2). 

CHAPTER XVII 


8. E(X) = te™; Var(X) = ἐολί(ολὲ — 1). 
4, Pla = —d\nP, +A(n 4+ 1)},.... 


= () “τῷ — I αι Ὁ. 
(1) 


E(X) = ἰεσλ: Var(X) = ie“(1 — e), 


5. P(t) = —(A + nu)Palt) -Ἐ ΧΡ, -α(ἢ + (n + 1)ePaai for n< N-1 
and P’y(t) = —NuPy(t) + ΧΡν. (ἢ. 
6. Birth and deaths process with Δ, = X, ban = 
19. The standard method of solving linear differential equations leads to a 
system of linear equations. Cf. the hint contained in footnote 3 of chapter 
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Index 


Absolute probabilities 106; —in Markov 
chains 349, 373. 

Absorbing barriers 311, 329, 341. 

Absorbing boundaries in stochastic proc- 
esses 433. 

Absorbing states 349, 362. 

Absorption probabilities: in birth and 
death process 409, 410; in diffusion 
327, 336; in Markov chains 362ff., 
378, 392ff.; in random walk 313, 329, 
335 (in generalized random walk 331). 
[Cf. Duration of game; Extinction.] 

Acceptance cf. Inspection sampling. 

Accidents: distribution of damages 270, 
398; models involving Bernoulli 
trials with variable probabilities 216; 
models with urns 109, 111; occu- 
pancy models 10; Poisson distribu- 
tion 147; statistics of bomb hits 150. 

ApLmeR, H. A., and K. W. MILuer 420. 


Aftereffect, urn models for — 109. [Cf. 
Markov property.| 
Age distribution 294ff., 309; (example 


involving ages of a couple 13, 16). 
Aggregates, self-renewing 284, 294, 309. 
Aging, absence of — 305, 411. 
ANDERSEN, cf. SPARRE ANDERSEN, E. 
Anpr#, D. 70, 335. 

Animal populations: recaptures 43; 

trapping of — 160, 224, 269, 276. 
Aperiodic (= not periodic) cf. Pertodic. 
Arc sine laws 77, 80, 86; counterpart 

72. 

Arrangements cf. Ballot problem; Occu- 
pancy problem; Ordering. 

Assignable causes 40. 

Atomic bomb 273. 

Average of a distr. = Expectation. 

Averages, moving 371, 379. 

Averaging, repeated 292, 308, 377. 


b(k; n, p) 187. 

BacHE ier, L. 323. 

Backward equations 421, 427, 480, 436. 

Bacteria counts 153. 

Batey, N. T. J. 48. 

Ballot problem 66, 70. 

Balls in cells cf. Occupancy problem. 

ΒΑΝΑΟΘΗ 8 match box problem 157, 212; 
variants 160. 

Barriers, classification of 312, 341. 

Barrxy, W. 331, 346. 

Bates, G. E. 267. 

Bayss’s rule 114. 

BERNOULLI, D. 236. 

BERNOULLI, J. 135. 

BERNOULLI trials: definition 135; infinite 
sequences of — 183; number theo- 
retical interpretation 195. [Cf. Arc 
sine laws; Betting; Billiards; First 
passages; Random walk; Runs in 
Bernoulli trials;  etc.] 

BERNOULLI trials, multiple 158, 160, 223. 

BERNovuuu trials with variable proba- 
bilities: definition 205; Poisson ap- 
proximation 263; variance 216. 

Brernsten, 8S. 117, 178. 

BERTRAND, J. 66. 

Beta function 168. 

Betting: ruin problem 313ff.; — in 
games with infinite expectation 235ff.; 
— on runs 183, 197, 303; —— systems 
185, 315; three players taking turns 
18, 24, 108, 130, 376. 

Billiards 265. 

Binomial coefficients 32, 48; identities 
for — 61, 85, 102; integrals for — 
325, 337. 

Binomial distribution 136; central term 
139, 182; —- combined with Poisson 
160, 269, 277; —- as conditional distr. 
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in Poisson process 223; convolutions 
of — 162, 252; expectation of — 209, 
252 (absolute expectation 226); gen- 
erating fct. 252; integral representa- 
tions of — 163, 323, 337; — as limit- 
ing form of hypergeometric distr. 57, 
161, and of Ehrenfest model 358; 
normal approximation to — 168ff.; 
Poisson approximation to — 142, 161, 
176 (numerical examples 98, 143, 
159); — in occupancy problems 34, 
98; tails of — 139, 163, 178, 181; 
variance 214, 216, 252. 

Binomial distribution, the negative cf. 
Negative binomial distr. 

Binomial formula 49. 

Birth and death process 366, 407ff., 435; 
for servicing problems 413, 434. 

Birth process 402, 422, 434; divergent — 
404, 429ff. 

Birthdays: duplications 31, 46, 57, 440; 
expected numbers 210, 224; — as 
occupancy problem 10, 58; Poisson 
distr. for — 94, 144; (combinatorial 
problems involving — 55, 159). 

Bwariate generating functions 261, 277, 
309; — negative binomial 267; 
— Poisson 162, 261. 

BiztEy, M. T. L. 66. 

BLACKWELL, D. 286. 

Blood counts 153; — tests 225. 

BoL_tTzMANN-MAxWELL statistics 5, 21, 
39, 57ff., 91, 103; — as limit for 
Fermi-Dirac statistics 56. 

ΒΟΝΡΕΒΕΟΝΙ 8 inequalities 100, 181. 

Booue’s inequality 28. 

BoreEL, E. 191, 197. 

BorEL-CanTELLI lemmas 188. 

Bost-EInstEIn statistics 5, 20, 38ff., 59, 
108. 

Borrrma, O., and 5. C. Van VEEN 
265. 

Boundary points for stochastic processes 
433. 

Branching processes 272ff., 277. 

Breeding 132, 347, 376, 394. 

Bridge: ace distr. 11, 36, 55, 56, 158; 
composition of hands 33, 55, 56, 90, 
101; conditional probabilities 129; 
definition 8; waiting times 56; — 


illustrating algebra of events 17, 24. 
[Cf. matching of cards; Shuffling.] 
Brother-sister mating 132, 347, 376, 394. 
Brownian motion cf. Diffusion. 

Busy hour 400. 


Canonical decomposition of matrices 380. 

CaNTELLI, F. P. 191; Boret-CaNnTELLI 
lemmas 188. 

Cantor, G. 18, 263. 

Cards cf. Bridge; Matching of cards; 
Poker; Shuffling. 

CaTcHESIDE, D. G. 54, 269; —, Ὁ. E. 
Lea, and J. M. TuHopay 102, 152. 

Causes, probability of 114. 

Cells, distr. of balls in, cf. Occupancy 
problems. 

Centenarians 145. 

Center of gravity 214. 

Central force in diffusion 344. 

Central limit theorem: applications of — 
to combinatorial problems 180, 241; 
— to frequency of decimals 196; 
— to hypergeometric distr. 180; — 
to Poisson distr. 176, 180; — to 
recurrent events 297; — to random 
walks 325; — to runs 180, 300; for 
binomial distr. 173; for Markov 
chains 373; for sums of random vari- 
ables 229, 238ff., 245. 

Chain letters 55. 

Chain reactions 272ff., 277. 

Chains, random, length of 225. 

CHANDRASEKHAR, 9. 377. 

Channels cf. Counters. 

Cuapman, D. G. 43. 

CHaPpMAN-KoLMOGOROV equation: for 
Markov chains 370, 373; for stochas- 
tic processes 424, 436. 

Characteristic equation 332. 

Characteristic values for matrices 384. 

CHEBYSHEV, P. L. 219; — inequality 
219, 227. 

Chess problems 53, 101. 

Chi-square test: mentioned in connection 
with tabular material, but not defined. 

Chromosomes: breakages and _inter- 
changes of 54, 102, 151, 152, 161, 269; 
explained 121ff. 

Cuune, Κι. L. 72, 77, 227, 286, 375. 


INDEX 


CLARKE, R. D. 150. 

Classification, multiple 27. 

Closed sets, closures 349. 

CocHraNn, W. G. 41 

Coin tossing: as occupancy problem 11, 
46; as random walk 73, 311ff.; distr. 
of leads 68, 77ff.; empirical illustra- 
tion 21, 838; ties in multiple — 289, 
308. . [Cf. BERNOULLI trials; First 
passages; Random walk; Runs in 
Bernoulli trials.| 

Coincidences = matches 90, 97, 102. 

Collector’s problem 11, 46, 59, 102; 
moments 210, 224, 265. 

Colorblindness as sex-linked character 
126. 

Combinatorial problems: use — of cen- 
tral limit th. 180, 241; — of ran- 
domization 277. 

Combinatorial product 120. 

Combinatorial runs 40, 60, 225; normal 
distr. for — 180. 

Competition problem 175. 

Complementary events 15. 

Composite Markov processes (shuffling) 
372. 

Composition = convolution 250. 

Compound distributions 268; com- 
pounding the binomial and Poisson 
160, 269, 277. 

Compound Poisson distribution and 
process 270, 398, 428; negative bino- 
mial as compound Poisson distr. 271. 

Conditional distribution 204, 223; — οχ- 
pectation 209; — probability 104ff. 

Confidence level 176. 

Configurations in occupancy problems 
37, 56. 

Connections to a wrong number 152. 

Contagion 111ff., 434. 

Continuity theorem 262. 

Convergence, almost everywhere and in 
measure 196, 243. 

Convolutions 250. 

Coordinate space 120. 

Correlation coefficient 221. 

Cosmic rays 11, 404. 

Counters 57; — of type I 279, 294, 308, 
377; — οὗ type II 308; waiting line 
and servicing problems 413ff. 


453 


Coupon collecting 11, 46, 59, 102; mo- 
ments 210, 224, 265. 

Covariance 215, 222. 

Crameér, H. 149. 

Cumulative distr. function 168. 

Cycles 242. 

Cyclical random walk 343, 386. 

Cylindrical sets 120 


DAHLBERG, G. 129. 

Damage cf. Accidents; Radiation effects. 

Darwin, C. 69. 

Death process 434. 

Decimals: distribution of 195; — of e 
and x 30, 59. [Cf. Random digits. ] 

Defective random variables 283. 

Defectives: inspection plans 158, 160, 
223, 331, 345 (blood tests 225); 
Poisson distr. for — 144; (elementary 
problems mentioning — 54, 130). 

“Degenerate processes’’ 404, 429. 

Delayed recurrent events 293, 352. 

DEMEREC, M. 269. 

DeMorvere, A. 168, 248, 266. 

DeEMotvre-Lapiace limit theorem 172, 
181. 

Density fluctuations 377. 
FEST model. ] 

Density function 168. 

Dependent cf. Independence; Stochastic. 

Derivatives, number of 37. 

DerMan, C. 376. 

Descendants: in birth and death process 
402, 407; in branching processes 
272ff., 277; breeding 132, 347, 376, 
394; family relations of — 188: 
— in population and renewal theory 
295, 309; genetical models 12I1ff., 
240, 347. 

Determinants (number of terms contain- 
ing diagonal elements) 101. 

Dice: ace runs 183, 197, 300; distr. of 
scores 201, 214, 229; equalization of 
ones, twos, ..., 281, 289; generating 
fet. 266; — as occupancy problem 
11; Weldon’s data 138; (elemen- 
tary problems 36, 46, 54, 101, 129, 
158, 159, 179, 223, 376). 

Difference equations 314; 
images 335; 


[Cf. ΒΉΒΕΝ- 


method of 
method of particular 
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solutions 314, 319, 332, 334; passage 


to limit 324ff., 336, 435; several 
dimensions 329, 337; special — : ab- 
sorbing barriers 365; — Ehrenfest 


model 358; — occupancy problem 
58, 265; — Polya distr. 181; — re- 
flecting barriers 376, 389; — renewal 
equation 290. 

Difference of events 16. 

Differential equations, Kolmogorov’s: 
backward 427; forward 426; special 
cases 401ff.; uniqueness 431. 

Diffusion 323; absorption and first 
passages 326, 336; Ehrenfest model 
111, 343, 358, 377; — coefficient 
325. 

Drrac-FrRmi statistics 5, 39, 56. 

Discrete sample space 17. 

Dishes, test involving breakage of 55. 

Disorder in chance fluctuations 217. 

Dispersion = variance 213. 

Distinguishable 11, 20,39; two kinds of 
elements 34. 

Distribution: conditional 204, 228; joint 
— 200; marginal — 201; normal — 
164; — function 168, 200; [— of 
balls in cells cf. Occupancy problems]. 

Dopasg’s inspection plan 223. 

Dorsuin, W. 374. 

Doms, C. 277. 

Dominant gene 122. 

Domino 53. 

Doos, J. L. 186, 375, 433. 

DorrMan, R. 225. 

Double Bernoulli trials 158, 160, 223. 

Double (bivariate) generating functions 
261. 

Double sampling 223, 331, 346. 

Doubling system 316. 

Doubly stochastic matrices 358. 

Drift 311, 325. 

Drugs, testing of 69, 139. 

Duality principle 70. 

Duration of games: in Markov chains 
378; in ruin problems 317ff., 334; 
in sequential sampling 330, 334. [Cf. 
Extinction; First passages; Waiting 
times.]} 

DvorxgTzky, A., and T. Morzxin 66. 


INDEX 


ὃ for recurrent events 278, 282. 

e, distr. of decimals 30, 59. 

Efficiency, tests of 69, 139. 

EGGENBERGER, F. 109. 

ΒΗΒΕΝΕΈΒΥ, P. and T. 111, 848; — 
model for heat exchange and diffusion 
111, 348, 358, 377. 

Eigenvalue, ergenvector 384. 

EINSTEIN-BosE statistics 20, 38ff., 59, 
103. 

EINSTEIN- WIENER diffusion 828. 

EIsENHART, C., and F. 8. Swrp 40. 

Elastic barrier 312, 384, 341. [Cf. Ab- 
sorbing barriers; Reflecting barriers.| 

Elastic force in diffusion 344. 

Elevator problem 11, 31, 56, 440. 

Eurs, R. Εἰ. 323. 

Equilibrium, macroscopic or statistical 
356, 409. 

Erpos, P. 80, 198, 286. 

Ergodic properties — of Markov chains 
356ff., 378, 395; — of stochastic proc- 
esses 409, 436. 

Ergodic states 353. 

Eruane, A. K. 413; 
418. 

Error function 168. 

Escapes in stochastic processes 149ff. 

Essential states 3538. 

Estimation, statistical, from: simple sam- 
ples 176, 211, 223; — repeated sam- 
pling 48; — independent observa- 
tions 160. 

Estimator, unbiased 227. 

Events 8, 13ff.; compatible — 88; inde- 
pendent — 117; simultaneous reali- 
zation of — 16, 89, 96, 99; — in 
product spaces 118ff. 

Evolution 404. 

Exclusive events 15. 

Expectation 207; conditional — 209; 
infinite — 249; — of products 208, 
215, 221; — of reciprocals 224, 226, 
227; — of sums 208. 

Experiments, conceptual 4, 7ff.; 
pound and repeated — 118. 

Exponential distribution 399, 411, 429; 
characterization by a functional equa- 
tion 413. 

Exponential holding times 305, 411. 


—’s loss formula 


com- 


INDEX 


Extinction: in birth and death process 
410; in branching processes 274; 
— of genes 124, 274, 365. 

Extrasensory perception (ESP) 54, 368. 


F for failure 135. 

Factorials 29; gamma fect. 63; Stirling’s 
formula 50, 64, 169. 

“Fair” games 233ff., 246, 315; — with 
infinite expectations 2386; unfavora- 
ble — 235, 246. 

Faltung = convolution 250. 

Families: problems — on sex distr. 107, 
108, 115, 180, 158, 269; — on dish- 
washing 55. 

Family names, survival of 273. 

Famuly relations 133. 

Family size, geometric distr. for 180, 274. 

“Favorable” cases 23, 26. 

Fermi-D1rac statistics 5, 39, 56. 

Fire cf. Accidents. 

Firing at targets 10, 159. 

First occurrence cf. Watting times. 

First passages in Bernoulli trials and 
random walks 74, 280, 312; expecta- 
tion 254, 317; explicit formulas 76, 
322, 335ff.; generating fcts. 254, 308, 
318, 335ff.; limit theorems 87, 326, 
336. (Cf. Duration of games; Re- 
turns; Wazting times.] 

First-passage times: in diffusion 226, 
335ff.; in Markov chains 352, 362 
(expectation 378, 395); in stochastic 
processes 436. 

Fish catches 43. 

FisHer, R. A. 6, 44, 188, 274, 347; 
—’s logarithmic distr. 269. 

Fission 273. 

Flaws in material 149, 159. 

FOKKER-PLANCK equation 326. 

Forward equations 422, 426, 431ff. 

FrécuHet, M. 88, 101, 375, 377, 380. 

Frequency function 168. 

FRIEDMAN, B. 109, 343. 

FRoOBENIUS’ theory of matrices 375. 

Fry, T. C. 188, 413. 

Furry, W. H. 404. 

Furru, R. 371; —’s formula 327. 


G.-M. counters cf. Counters. 
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Gatton, F. 69, 241, 273. 

Gambling systems 185, 315. 

Gamma function 63. 

Gaussian (= normal) distribution 168. 

Generating functions 248ff.; bivariate — 
261, 277, 309; moment — 267, 277; 
use of — for solving difference eqns. 
318; — for differential eqns. of 
stochastic processes 4385; — for 
Markov chains and matrices 380ff. 

Genes and genotypes 10, 121ff., 182; 
inheritance 240; Markov chains 347, 
365; mutations and survival 274, 
365, 403. 

Geometric distribution 156, 223, 304ff.; 
exponential limit 412; — for family 
size 1380, 274; lack of memory 304, 
412; — as limit of Bose-Einstein 
statistics 59; — as negative binomial 
distr. 156, 210, 252; — in special 
problems 48, 59, 223, 276; — in 
stochastic processes 435. 

Goncaroyv, V. 243. 

GREENWOOD, J. A., and E. E. Stuart 
54, 368. 

GREENWOOD, R. E. 59. 

Grouping in Markov chains 379. 

Grouping, tests of 40. 

Guessing 98, 217. 

GuMBEL, E. J. 145. 


Haemophilia as sex-linked character 126. 

Harpy, G. H. 124, 196; —’s law 124, 
132. 

Hararis, T. E. 276, 376, 379. 

Hausporrr, Εἰ. 191, 196. 

Heat exchange, Ehrenfest model for 111 
343, 358, 377. 

HELLY’s theorem 263. 

Higher sums 370. 

Hopass, J. L. 69, 72. 

Hoerrpine, W. 217. 

Holding times, exponential 305, 411. 

Homogeneity, tests of 41. 

Hostinsxy, B. 375. 

Hypergeometric distribution 41ff., 55, 56, 
218; approximation of — by binomial] 
57, 161; — by Poisson 162; — by 
normal 180; double — 45. 
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Hypothesis 105; statistical — cf. Tests, 
statistical. 


Images, method of 70, 335. 
Implication 16. 

Improper random variable 283. 
Incoming traffic 412. 


Independence, stochastic (= statistical) 


114, 204, 227. 

Indistinguishable 11, 20, 39; two kinds 
of elements 34. 

Inertia, moment of 214. 

Infinite moments 249; limit theorems 
for — 2381, 239, 246, 298; — in ran- 
dom walks 83, 87, 326, 336). 

Infinitely divisible distributions 271. 

Inheritance 121ff., 240. 

Initials 538. 

Insect litters and survivors 161, 269. 

Inspection sampling 42, 158, 160, 224; 
sequential — 330, 334, 345. 

Intersection of events 16. 

Inverse probabilities in Markov chains 
373. 

Inverstons 241. 

Irradiation, harmful 10, 54, 102, 152, 
161, 269. 

Irreducible chains 349. 

Istnq’s lattice model 41. 

Iterated logarithm, law of the 191, 196; 
generalized — 197, 198; — for 
Markov chains 374, 


Kac, M. 54, 80, 111, 348, 391. 
Kaxutant, 8., and K. Yosrpa 378. 
Kar 11n, 8., and J. McGreaor 408. 
KELVIN’s method of images 70, 335. 
KENDALL, D. G. 269, 274, 410. 
KENDALL, M. G., and B. ὅμιτη 144. 
Key problem 46, 54, 180, 224. 
KHINTCHINE, A. 181, 192, 196, 229. 
Koxtmocoroy, A. 6, 195, 286, 323, 353; 
—'s criterion 243 (converse 247); 
—’s differential equations 423ff.; 
—’s inequality 220; Chapman-Kol- 
mogorov equations 370, 373, 424, 436. 
Koopman, B. O. 4, 408. 


Ladder points 280, 308. 
LaGRANGE, J. L. 266, 322. 
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Lapuace, P. S. 168, 248, 377; —’s law 
of succession 113; DeMoivre-Laplace 
limit theorem 172, 181. 

Largest observation, estimation from 211, 
228. 

Larvae 161, 269. 

Latent roots and vectors 384. 

Law of the arc sine 77, 80, 86; counter- 
part 72. 

Law of the iterated logarithm 191ff., 196; 
generalized — 197; — for Markov 
chains 374. 

Law of large numbers, the strong 248, 247; 
— for Bernoulli trials 190, 196; — 
for Markov chains 374. 

Law of large numbers, the weak: for 
Bernoulli trials 141, 181, 198;  classi- 
cal forms 228, 238, 244ff.; for de- 
pendent variables 246; generalized 
form (for infinite moments) 236; for 
Markov chains 374; for permuta- 
tions 241; for recurrent events 297. 

Law of rare events or small numbers 149. 

Law of succession 118. 

Leads, distribution of 67, 72, 77ff., 142; 
empirical illustration 83. 

LEDERMANN, W., and G. E. REuTER 
408. 

Lefthanders 159. 

Livy, P. 80, 271. 

Lightning, distribution of damage 270, 
398. 

LINDEBERG, J. W. 229, 239. 

LitTLEWoop, J. E. 196. 

Lyapunov, A. 229, 246. 

Logarithm, inequalities and series for 48. 

Long chain molecules 11, 225. 

Loss, coefficient of, 420. 

Loss formula, Erlang’s 418 

LotxKa, A. J. 180, 273. 

Lunch counter example 40. 

LUNDBERG, O. 434. 


McCrea, W. H., and F. J. W. WHIPPLE 
327, 330. 

M’Kenprick, A. G. 404. 

McGrecor, J., and 8. Karur 408. 

Machine servicing 416ff. 

Macroscopic equilibrium 356, 409. 

Mater-Lersnirz, H. 279. 
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Μαιΐοσου, G. 347. 

Maree, K. 136. 

MARGENAU, H., and G. M. Murpay 39. 

Marginal distribution 201. 

Markov, A. 229, 338. 

MarxKov chains: associated with sto- 
chastic processes 366, 378, 409, 428; 
definition 340; mixtures of — 379; 
superposition of — 372. 

Markov chains of higher order 376. 

Markov process 368ff., 379; — with 
continuous time 397ff., 423ff. 

Markov property 305, 369. 

Match box problems 157, 160, 212. 

Matching of cards 90, 97, 102, 217. 

Mating (assortative and random) 122, 
132; brother-sister — 132, 347, 376, 
394. 

Matriz: canonical decomposition 380; 
— notation 133, 348, 384;  parti- 
tioned — 351, 355, 392; stochastic 
— 340 (doubly stochastic — 358; 
non-stochastic — 374, 392ff.). 

Maxima in random walks: distribution 
335; position 86. [Cf. Largest obser- 
vation. | 

Maximum likelthood 44. 

MAXWELL-BoLTZMANN statistics 5, 21, 
39, 57ff., 91, 103; — as limit for 
Fermi-Dirac statistics 56. 

Mean, cf. Expectation. 

Median 48, 207. 

Memory in waiting times 305, 411. 

MENDEL, G. 121. 

Mérk’s paradox 54. 

Miuter, K. W., and H. A. ADLER 420. 

Mises, R. von 6, 31, 94, 186, 191, 300, 
310. 

Misprints 11; Fermi-Dirac distr. for — 
39, 56; Poisson distr. for — 145, 159. 

Miztures: of distributions 277; of 
Markov chains 379; of populations 
1118. 

Molecules, long chain 11, 225. 

Moutna, E. C. 145, 177. 

Moment generating function 267, 277. 

Moment of inertia 214. 

Moments 213; infinite — 249. 

Montrmort, P. R. 90. 

Moon, A. M. 180. 
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Moran, P. A. P. 160, 161. 

Morse code 53. 

Mortality 294ff., 309. — 

Morzxrn, T., and A. Dvorerzxy 66. 

Moving averages 371, 379. 

Multinomial coefficients 35. 

Multinomial distribution: 157, 208, 224; 
generating fct. 261; maximal term 
161, 180. 

Multiple Bernoulli trials 158, 160, 223. 

Multiple classification 27. | 

Multiple Poisson distribution 162. 

Multtplets 27. 

Murpuy, G. M., and H. Maregnau 39. 

Mutations 274, 404. 


(n), 28. 

Negation 15. 

Negative binomial, bivariate 267. 

Negative binomial distribution 155, 210, 
252; —in birth process 404; infinite 
divisibility 271; — as limit of Bose- 
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